Category Archives: Databases

Matters relating to data management and storage.

Logging footfall counts with a Raspberry Pi and camera – results dashboard

Here at Cranfield University we are putting in place plans related to the new ‘Living Laboratory’ project, part of our ‘Urban Observatory’. This project sits within the wider UKCRIC initiative, across a number of universities. Of the many experiments in development, we are exploring machine vision as a means to provide  footfall counts of pedestrian traffic in parts of the campus. This blog builds on an earlier blog summarising some of the technical considerations relating to this work, and shows how a simple dashboard can be developed with NodeJS and the templating engine pug.

In the earlier blog, we captured sensor data from Raspberry Pi Zeros with cameras, running Kerboros, to a Postgres database. This quickly builds up a vast body of data in the database. Finding a way to present this data in a web-based dashboard will help us investigate and evaluate the experiment results.

We investigated a range of tools for presenting the data in this way. The project is already using Node.js to receive postings from the cameras using HTTP POST requests. The Node.js environment then communicates with the back-end Postgres database to INSERT new records. The same Node.js environment can also be used to serve up the results of queries made of the database in response to standard HTTP GET requests.

The Postgres database design has the following structure:

id integer NOT NULL DEFAULT [we used the data type SERIAL to outnumber records, and set this field as the Primary Key]
"regionCoordinates" character varying(30)
"numberOfChanges" integer
incoming integer
outgoing integer
"timestamp" character varying(30)
microseconds character varying(30)
token integer
"instanceName" character varying(30)

An SQL query to extract summary data for a simple results table is, for example:

SELECT database."instanceName" AS location,
  COUNT(id) AS count,
  SUM(database."numberOfChanges") as changes
FROM database
GROUP BY database."instanceName"

Note the use of the single quotation marks around names having mixed case (e.g. ‘”instanceName”). Using pgAdmin to access Postgres, this query produces an output as follows:

Postgres query showing data summary

So with this query and data output, we have the data we need for a simple report. Next we need to build a web page using Node.js. We are already using Express, the lightweight web server. However, a powerful addition is a JavaScript HTML templating engine. Of the many on offer, we like pug, it is a high-performance template engine, implemented with JavaScript for Node.js and browsers, and has a clean and compact approach. We first need to install pug on the server with node present.

# The node app should be stopped
sudo systemctl stop node-api-postgres.service

# Update npm itself if required
sudo npm install -g npm

# Install JavasScript template engine 'pug'
npm i pug

# We may then need to rebuild the app dependencies
npm rebuild

With pug installed, we can update the Node.js scripts built in the earlier blog. First index.js:

index.js

!/usr/bin/env node
// index.js
const express = require('express')
const bodyParser = require('body-parser')
const pug = require('pug')
const app = express()
const db = require('./queries')
const port = 3000

app.use(bodyParser.json())
app.use(
  bodyParser.urlencoded({
    extended: true,
  })
)
app.set('views','views');
app.set('view engine', 'pug');

app.get('/', (request, response) => {
  response.send('Footfall counter - Service running')
  console.log('Footfall counter - Service enquiry received')
})

app.post('/counter', db.createFootfall)
app.get('/stats', db.getStats)

app.listen(port, () => {
  console.log(`App running on port ${port}.`)
})

Note in index.js the app settings. app.set views tells pug where the templates will be stored for generating the web page. app.set. engine tells Node.js to use pug to generate the content.

Note also the new REST endpoint ‘/stats’ introduced with an HTTP GET request (app.get), and is associated with the Node.js function getStats in the associated queries.js module (file).


queries.js

// queries.js

const Pool = require('pg').Pool
const pool = new Pool({
   user: '<<username>>',
   host: '<<hostname>>',
   database: '<<database>>',
   password: '<<password>>',
   port: 5432,
})

const createFootfall = (request, response) => {
  const {regionCoordinates, numberOfChanges, incoming, outgoing, timestamp, microseconds, token, instanceName} = request.body
  pool.query('INSERT INTO <<tablename>> ("regionCoordinates", "numberOfChanges", "incoming", "outgoing", "timestamp", "microseconds", "token", "instanceName") VALUES ($1, $2, $3, $4, $5, $6, $7, $8) RETURNING *', [regionCoordinates, numberOfChanges, incoming, outgoing, timestamp, microseconds, token, instanceName], (error, result) => {
    if (error) {
      console.log(error.stack)
    }
    console.log(`Footfall added with the ID: ${result.rows[0].id}`)
    response.status(200).send(`Footfall added with ID: ${result.rows[0].id}\n`)
  })
}

const getStats = (request, response) => {
  pool.query('SELECT <<tablename>>."instanceName" AS location, COUNT(id) AS count, SUM(<<tablename>>."numberOfChanges") as changes FROM footfall GROUP BY <<tablename>>."instanceName"', (error, result) => {
    if (error) {
      throw error
    }
    response.status(200).render('stats', {title: 'Footfall counter statistical reporter', rows: result.rows})
  })
}

module.exports = {
  createFootfall,
  getStats,
}

Note in queries.js the definition of the function getStats. This function is associated with the HTTP GET request on ‘/stats’ from the earlier index.js file. The function returns a status of ‘200’ and then calls the pug render function with a couple of parameters, first the title of the resultant web page we want, and then more importantly the recordset (‘rows’) resulting from the SQL query – the entire object is passed through. Lastly, getStats is added to the module exports at the end of the file.

Next we have to set up the pug template and style sheet used to generate the HTML file output. Note we defined the folder ‘views’ to hold these files in ‘index.js’. We create a new folder called ‘views’ and create a pug template file called ‘stats.pug’. Pug files all have the extension ‘.pug’.


stats.pug

doctype html
html(lang='en')
  head
    meta(charset='utf-8')
    title #{title}
    style
      include stats.css
  body
    div#header
      h1 #{title}
      ul#minitabs
        li
          a(href='#') Stats 1
        li
          a(href='#') Stats 2
    p Statistical summary of the Footfall counter experiment
    div#content
      table
        thead
          tr
            th Camera Location
            th Number of postings
            th Count of instances
        tbody
          each row in rows
            tr
              td #{row.location}
              td #{row.count}
              td #{row.changes}
    p Above is a summary of the footfall counts recorded to date from the start of the experiment
    div#footer
      h1 Living Laboratory - Urban Observatory

The pug template has a particular markup format that is used. This takes a bit of getting used to, but does result in a clean document ready for rendering down into HTML for return to the HTTP request. Node the way the parameters are integrated into the page, firstly the page title, and then the expression of the recordset as a variable length table, using a ‘for each’ type structure to iterate the recordset. Note lastly the use of the ‘insert’ directive to include the CSS file into the HTML. The file ‘stats.css’ looks like this:

stats.css

/* Example CSS Document - Screen version */

/* Screen partitions */
#header {
  padding: 5px;
  }

#content {
  float: right;
  padding: 5px;
  width: 100%;
  }

#footer {
  clear: right;
  padding: 5px;
  }

/***************************/

/* Top Menu definition */
#minitabs {
  margin: 0;
  padding: 0 0 20px 0;
  font-family: Arial, sans-serif;
  font-size: 15px;
  border-bottom: 1px solid #ffcc00;
  }

#minitabs li {
  margin: 0;
  padding: 0;
  display: inline;
  list-style-type: none;
  }

#minitabs a {
  float: right;
  line-height: 14px;
  font-weight: bold;
  margin: 0 10px 4px 10px;
  text-decoration: none;
  color: #ffcc00;
  }

#minitabs a.active, #minitabs a:hover {
  border-bottom: 4px solid #696;
  padding-bottom: 2px;
  color: #363;
  }

/***************************/

/* Table definitions */

table {
  border-collapse: collapse;
  }
  
tbody {
  color: #999;
  } 
  
th, td {
  border: 1px solid #999;
  padding: 10px;
  }

caption, th {
  font-family: Verdana, sans-serif;
  font-size: 15px;
  font-weight: bold;
  padding: 10px;
  background-color: #696;
  color: white;
  }

/***************************/

/* General text decorations */

/* Drop capital - decorative effect */
.drop {
  float: left;
  font-family: Verdana, sans-serif;
  font-size: 450%;
  line-height: 1em;
  margin: 4px 10px 10px 0;
  padding: 4px 10px;
  border: 2px solid #ccc;
  background: #eee;
  }

/* Inset box */
.inset_box {
  float: right;
  font-family: Arial, sans-serif;
  font-weight: Bold;
  color: #999;
  margin: 10px, 5px, -2.5px, 10px;
  padding: 5px;
  border: 2px solid #ccc;
  background: #eee;
  width: 30%;
  }

/* General body text definition */
body {
  font-family: Georgia, Times, serif;
  line-height: 1.3em;
  font-size: 15px;
  text-align: justify;
  display: block;
  }

.noprint {
  font-size: 20px;
  color: "#FF3333";
  }

/* All link colours */
a:link, a:vlink, a:hlink, a:alink {
  color: #ffcc00;
  }

/* Headings */
h1 {
  font-family: Arial, sans-serif;
  font-size: 24px;
  font-variant: small-caps;
  letter-spacing: 4px;
  color: #ffffdd;
  padding-top: 4px;
  padding-bottom: 4px;
  background-color: #ffcc00;  
  }  

h2 {
  font-family: Arial, sans-serif;
  font-size: 14x;
  font-style: Italic;
  color: #ffff99;
  padding: 0;
  background-color: #ffcc00;  
  }  

/* Horizontal rule */
hr {
  display: inline;
  color: #ffff99;
  }

/* Abbreviations &amp; Acronyms */
abbr, acronym {
  border-bottom: 1px dotted;
  font-weight: Bold;
  cursor: help;
  color: #ffcc00;
  }

Not all the definitions in the CSS are used, but it is a useful set of styles. The file ‘stats.css’ is placed alongside ‘stats.pug’ in the ‘views’ folder.

At this point the Node.js app can be restarted

# The node app should be started
sudo systemctl start node-api-postgres.service

The app is up and running again, and hopefully continuing to receive data from the cameras with the HTTP POST requests arriving at the end-point ‘/footfall’. However, now, we can also point our web browser at the end-point ‘/stats’, and we should hopefully see the following simple dashboard report.

http://<<IP_ADDRESS>>:3000/stats
pug app, up and running, styled by CSS
the pug code

Epilogue

The blog here described taking data from a Postgres database via a custom SELECT statement and using Node.js to paste data to the JavaScrip templating engine pug to prepare a simple HTML dashboard.

Future work could consider improved graphics, perhaps drawing from the graphics library D3, and its port to Node.js, d3-npm.

Perl vs Python

Here at Cranfield University we work a lot with data in our GIS and data-related teaching and research. A common challenge is in transferring a complex dataset that is in one format into another format to make it useable. Many times there are tools we can use to help in that manipulation, both proprietary and open source. For the spatial datasets we often work with, we can use the range of data convertors in ArcGIS and QGIS, we can use the fantastic ‘Feature Manipulation Engine’ (FME) from Safe inc., or its manifestation in ArcGIS – the data interoperability tool, then again we can look to libraries such as the Geospatial Data Abstraction Library (gdal) for scripted functionality. As ever in computing, there are many ways of achieving our objectives.

However, sometimes there is nothing for it but to hack away in a favourite programming scripting language to make the conversion. Traditionally we used the wonderfully eclectic ‘Perl‘ language (pathologically eclectic rubbish lister – look it up!!) More recently the emphasis has perhaps shifted to Python as the language of choice. Certainly, if we are asked by our students which general purpose programming language to use for data manipulation, we advise Python is the one to have experience with on the CV.

If we have a simple data challenge, for example, we might want to convert an ASCII text file with data in one format to another format and write it out to a new file. We might want to go say from a file in this format (in ‘input.csv’):

AL1 1AG,1039499.00,0
AL1 1AG,383009.76,10251
etc......

To this format (in ‘output.csv’) …

UK,Item 1,R,AL1 1AG,,,,,,,,,,1039499.00,0,,,,
UK,Item 2,R,AL1 1AG,,,,,,,,,,383009.76,10251,,,,
etc......

For this Perl is a great solution – integration the strengths of awk and sed. Perl can produce code which quickly chomps through huge data files. One has to be careful as to how the code is developed, to ensure its readability. Sometimes, coming back to a piece of code one can struggle to remember how it works for a while – and this is especially so where the code is highly compacted.

#!/usr/bin/env perl
# Call as β€˜perl script.pl <in_file> > <out_file>β€˜
# e.g. perl script.pl input.csv > output.csv
use Text::CSV;
my $csv = Text::CSV->new({sep_char => ',' });
$j=1;
while (<>) {
  chomp;
  if ($csv->parse($_)) {
    my @fields = $csv->fields();
    printf("UK,Item %d,R,%s,,,,,,,,,,%s,%s,,,,\n",$j++,@fields[0],@fields[1],@fields[2]);
  }       
}

The equivalent task in Python is equally simple, and perhaps a little more readable…

#!/usr/bin/env python
# python3 code
# Call as 'python3 script.py'
import csv
o = open('output.csv','w')
with open('input.csv', 'r') as f:
   reader = csv.reader(f)
   mylist = list(reader)
j = 0
for row in mylist:
   j+=1
   o.write('UK,Item {:d},R,{:s},,,,,,,,,{:s},{:s},,,,\n'.format(j, row[0], row[1], row[2]$

Note the code above is Python3 not Python2. Like Perl (with cpan), Python is extensible (with pip) – and in fact one really needs to use extensions (modules, or imported libraries) to get the most out of it (and to help prevent you needing to reinvent the wheel and introducing unnecessary errors). There is no need to write lots of code for handling CSV files for example – the csv library above does this very efficiently in Python. Likewise, if say we want to write data back out to JSON (JavaScript Object Notation format), again the json library can come to the rescue:

import csv
import json
jsonfile = open('/folderlocation/output.json', 'w')
with open('/folderlocation/input.csv', newline='', encoding='utf-8-sig') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        print(', '.join(row))
        json.dump(row, jsonfile)

There is probably not really a lot in the difference between the two languages – it all rather depends on ones preferences. However, for GIS professionals, Python expertise is a must as it is adopted as the scripting language of choice in ArcGIS (in fact even being shipped with ArcGIS). Other alternatives exist of course for these sorts of tasks – ‘R‘ is one that comes to mind – again being equally extensible.

Using SQLite on a Raspberry Pi

There is a lot of interest in the amazing Raspberry Pi 3 computer here at Cranfield University. Sometimes, an application we build – for example on the Raspberry Pi – needs to store data – for example we might wish to store data from sensors. For this we need a database. However, again sometimes the overhead of having a full-blown database server up and running is too much. We need a simple, localised way to store and retrieve data. Fortunately there is a great solution to this, ‘SQLite’ (https://www.sqlite.org).

This brief tutorial shows how to setup and configure SQLite on a Raspberry Pi, and gives and example if it in use.

To start with, as with every time we use the Pi, we should update and upgrade the distribution – to ensure all the software within the Debian distribution are up-to-date. As described on the Raspberry Pi pages (https://www.raspberrypi.org/documentation/raspbian/updating.md), this is pretty straight forward.

First, update your system’s package list by entering the following command:

sudo apt-get update

Next, upgrade all your installed packages to their latest versions with the command:

sudo apt-get dist-upgrade

Once this is done, we are ready to install SQLite:

sudo apt-get install sqlite3

We will use the database to store data from within programming code – for example a software application that stores off sensor data. However, in the first instance, we can also run SQLite interactively to ‘create’, ‘read’, ‘update’ and ‘delete’ data directly in SQL tables. SQLite comes with a command line interpreter (CLI), that provides a command prompt to enter in commands. If we create a new database, we can try this out:

sqlite3 sensor_db

This starts up SQLite and will, at the same time, also create a new database – a single file, in this case called ‘sensor_db’ in your current folder.

Now we can create a database structure in our new database:

CREATE TABLE IF NOT EXISTS readings (
'temperature' REAL NOT NULL,
'humidity' REAL NOT NULL,
'datetime_int' INTEGER NOT NULL,
'sensor_id' INTEGER NOT NULL);

Note the semi-colon ending the statement. Now we can check that was created OK:

PRAGMA table_info([readings]);

Let’s place a couple of dummy data items (rows) into the table:

INSERT INTO readings (temperature, humidity, datetime_int, sensor_id) values (18.5,45.3,strftime('%s','now'),1);
INSERT INTO readings (temperature, humidity, datetime_int, sensor_id) values (19.4,42.8,strftime('%s','now'),1);

Now select the data to make sure it was stored correctly. Note we are using one of the two means of storing dates, here using integers not strings (see http://www.sqlitetutorial.net/sqlite-date/).

SELECT temperature, humidity, datetime(datetime_int, 'unixepoch'), sensor_id FROM readings;

The data should be shown…

To quit the interactive mode:

.exit

SQLite is a powerful database solution for small applications. Programming it is just like its larger server-based equivalents. It is worth spending time reading the SQLite tutorials for further information (http://www.sqlitetutorial.net). Another good tutorial is https://www.tutorialspoint.com/sqlite/index.htm.

Exploring traffic times data

A recent investigation here at Cranfield University considered the sources of road journey traffic time data, and this blog recounts some of that investigation. First of all comes the sources of the data.

Highways England Data

Thanks to the fantastic open data revolution we now have a huge wealth of public data available via the www.data.gov.uk portal. Here for example we can source data on traffic times from the Highways England agency. Their traffic times data can be obtained from https://data.gov.uk/dataset/dft-eng-srn-routes-journey-times

This data series provides average journey time, speed and traffic flow information for 15-minute periods since April 2009 on all morotways and ‘A’ roads managed by the Highways Agency, known as the Strategic Road Network, in England, with journey times and speeds estimated using a combination of sources, including Automatic Number Plate Recognition (ANPR) cameras, in-vehicle Global Positioning Systems (GPS) and inductive loops built into the road surface.

For example, we downloaded the CSV file: ‘Feb15.csv‘ relating to  February 2015 data. The first line of which by example reads:

LinkRef Link Description Date Time Period AverageJT Average Speed Data Quality Link Length Flow
AL215 A120 between A133 and A1232 (AL215) 2015-02-10 00:00:00 67 305.47 105.12 1 8.9200000762939453 286.50

This line of data relates to a stretch of road north of Colchester, UK on the A120. Key information here being that on 10th February 2015, for this c.9km stretch of road, it took 287 seconds (c4.8mins) to drive. The time of day is given as 67. This number is one of 96 15-minute intervals in the day that the data refers to (0-95 where 0 indicates 00:00 to 00:15). 67 is therefore 4:45:00 PM to 5:00:00 PM (see a useful table at the end of this article for working this out).

Google Traffic Data

Another useful source of data is from Google. The Google routing and traffic functions can be used by making a call to the Google ‘Distance Matrix’ API, described here:

https://developers.google.com/maps/documentation/distance-matrix/intro#traffic-model

Using the excellent ‘Postman‘ tool, We can formulate and test a REST call to the Google distancematrix API.

https://maps.googleapis.com/maps/api/distancematrix/json?units=metric&origins=enc:{s{{H{ovD:&destinations=enc:g{t{HqtiE:&departure_time=now&traffic_model=best_guess&key=<API KEY>

Parameters for this API are as follows:
units = metric values (e.g. km)
origins = startint point (encoded)
destinations = finish point (encoded)
departure time = can’t be historical, ‘now’ = keyword
traffic model = best guess (not optimistic/pessimistic)
API key = the personal API

The parameters origins and  destinations hold locations in latitude and longitude. As an alternative to decimal degree values for these, there can be encoded values used in the URL. To encode loctions the polyline utility can be used: See https://developers.google.com/maps/documentation/utilities/polylineutility

The resultant response to this REST call, made using Postman to send query off (GET), is:

{
    "destination_addresses": [
        "A120, Colchester CO7, UK"
    ],
    "origin_addresses": [
        "A120, Ardleigh, Colchester CO7, UK"
    ],
    "rows": [
        {
            "elements": [
                {
                    "distance": {
                        "text": "8.0 km",
                        "value": 7993
                    },
                    "duration": {
                        "text": "5 mins",
                        "value": 278
                    },
                    "duration_in_traffic": {
                        "text": "5 mins",
                        "value": 301
                    },
                    "status": "OK"
                }
            ]
        }
    ],
    "status": "OK"
}

Key information here being that at the time of making the call (‘now’), for this c.8km stretch of road, it took between 278 to 301 seconds (c4.6 to 5.0 mins) to drive. Key to this is the difference between the ‘duration’ and ‘duration_in_traffic’ values. Google note the allows you to ‘receive a route and trip duration (response field: duration_in_traffic) that take traffic conditions into account’. Note that ‘the departure_time must be set to the current time or some time in the future. It cannot be in the past’.

So in this way the Google approach allows a definition of the delays in drive time caused by traffic conditions. Although this cannot be determined retrospectively, a speculative future date can be selected whereby a prediction is made based on previous traffic conditions.


Utilities

The table used to calculate the time period for the Highways England data, described above:

Period From To
0 12:00:00 AM 12:15:00 AM
1 12:15:00 AM 12:30:00 AM
2 12:30:00 AM 12:45:00 AM
3 12:45:00 AM 1:00:00 AM
4 1:00:00 AM 1:15:00 AM
5 1:15:00 AM 1:30:00 AM
6 1:30:00 AM 1:45:00 AM
7 1:45:00 AM 2:00:00 AM
8 2:00:00 AM 2:15:00 AM
9 2:15:00 AM 2:30:00 AM
10 2:30:00 AM 2:45:00 AM
11 2:45:00 AM 3:00:00 AM
12 3:00:00 AM 3:15:00 AM
13 3:15:00 AM 3:30:00 AM
14 3:30:00 AM 3:45:00 AM
15 3:45:00 AM 4:00:00 AM
16 4:00:00 AM 4:15:00 AM
17 4:15:00 AM 4:30:00 AM
18 4:30:00 AM 4:45:00 AM
19 4:45:00 AM 5:00:00 AM
20 5:00:00 AM 5:15:00 AM
21 5:15:00 AM 5:30:00 AM
22 5:30:00 AM 5:45:00 AM
23 5:45:00 AM 6:00:00 AM
24 6:00:00 AM 6:15:00 AM
25 6:15:00 AM 6:30:00 AM
26 6:30:00 AM 6:45:00 AM
27 6:45:00 AM 7:00:00 AM
28 7:00:00 AM 7:15:00 AM
29 7:15:00 AM 7:30:00 AM
30 7:30:00 AM 7:45:00 AM
31 7:45:00 AM 8:00:00 AM
32 8:00:00 AM 8:15:00 AM
33 8:15:00 AM 8:30:00 AM
34 8:30:00 AM 8:45:00 AM
35 8:45:00 AM 9:00:00 AM
36 9:00:00 AM 9:15:00 AM
37 9:15:00 AM 9:30:00 AM
38 9:30:00 AM 9:45:00 AM
39 9:45:00 AM 10:00:00 AM
40 10:00:00 AM 10:15:00 AM
41 10:15:00 AM 10:30:00 AM
42 10:30:00 AM 10:45:00 AM
43 10:45:00 AM 11:00:00 AM
44 11:00:00 AM 11:15:00 AM
45 11:15:00 AM 11:30:00 AM
46 11:30:00 AM 11:45:00 AM
47 11:45:00 AM 12:00:00 PM
48 12:00:00 PM 12:15:00 PM
49 12:15:00 PM 12:30:00 PM
50 12:30:00 PM 12:45:00 PM
51 12:45:00 PM 1:00:00 PM
52 1:00:00 PM 1:15:00 PM
53 1:15:00 PM 1:30:00 PM
54 1:30:00 PM 1:45:00 PM
55 1:45:00 PM 2:00:00 PM
56 2:00:00 PM 2:15:00 PM
57 2:15:00 PM 2:30:00 PM
58 2:30:00 PM 2:45:00 PM
59 2:45:00 PM 3:00:00 PM
60 3:00:00 PM 3:15:00 PM
61 3:15:00 PM 3:30:00 PM
62 3:30:00 PM 3:45:00 PM
63 3:45:00 PM 4:00:00 PM
64 4:00:00 PM 4:15:00 PM
65 4:15:00 PM 4:30:00 PM
66 4:30:00 PM 4:45:00 PM
67 4:45:00 PM 5:00:00 PM
68 5:00:00 PM 5:15:00 PM
69 5:15:00 PM 5:30:00 PM
70 5:30:00 PM 5:45:00 PM
71 5:45:00 PM 6:00:00 PM
72 6:00:00 PM 6:15:00 PM
73 6:15:00 PM 6:30:00 PM
74 6:30:00 PM 6:45:00 PM
75 6:45:00 PM 7:00:00 PM
76 7:00:00 PM 7:15:00 PM
77 7:15:00 PM 7:30:00 PM
78 7:30:00 PM 7:45:00 PM
79 7:45:00 PM 8:00:00 PM
80 8:00:00 PM 8:15:00 PM
81 8:15:00 PM 8:30:00 PM
82 8:30:00 PM 8:45:00 PM
83 8:45:00 PM 9:00:00 PM
84 9:00:00 PM 9:15:00 PM
85 9:15:00 PM 9:30:00 PM
86 9:30:00 PM 9:45:00 PM
87 9:45:00 PM 10:00:00 PM
88 10:00:00 PM 10:15:00 PM
89 10:15:00 PM 10:30:00 PM
90 10:30:00 PM 10:45:00 PM
91 10:45:00 PM 11:00:00 PM
92 11:00:00 PM 11:15:00 PM
93 11:15:00 PM 11:30:00 PM
94 11:30:00 PM 11:45:00 PM
95 11:45:00 PM 12:00:00 AM

Corrupt Word docx document

We all use Word for report writing and general word processing, but arghhhh, what to do when the file becomes corrupted!

One of the students here at Cranfield University, suffered a recent misfortune of corrupting a MS Word docx file. The file had been 30+ pages of closely written text, ready for a thesis meeting, when disaster struck. Somehow the document became corrupted and opened as a blank document with no text. Inspecting it, we realised the document size was still 1.5Mb – so the text was probably still in the file – even if we couldn’t see it.

We tried all means of tricks to cajole Word to open the file and recover the precious text, all to no avail. At a point frankly of some desparation, we remembered the docx file format is a zipped XML structure file. This is a saving grace – the earlier ‘doc’ format was just a proprietary binary format, now the ‘docx’ format offered some hope.

We took a copy of the file, and renamed it ‘document.zip’. This allowed us to open the zip archive, and to see the contents. Immediately we see the hierarchical structure of the document and the multiple files it contains.
MS Word XML file structure

We could then open the folder ‘word’, which showed the principle contents of the document.
MS Word XML file structure

Straight away, we can see the sub-folder ‘media’ – this folder contained all the images that had been in the original document. Great – those were saved off. Now we needed to extract the text itself.

There is also the key file ‘document.xml’ – XML is XML is a software- and hardware-independent tool for storing and transporting data (http://www.w3schools.com/xml/xml_whatis.asp), stored in plain text. We extracted the text and loaded in our favourite text editor (TextWrangler – for the Mac). Inspecting the XML file shows the usual structures of xml – all spun onto a single line, thus:
MS Word XML file structure

We could then hunt through the file and locate the text in the XML, noting the tags within which the text was recorded. In this case, we can see text tagged with <w:t>.
MS Word XML file structure

So now we needed an automated method to extract all the text from the document – to start with, no such tool is present on the Mac. However, thanks to Kevin Peck’s excellent blog here (http://kevsaidwhat.blogspot.co.uk/2013/03/other-mac-things-i-have-learned.html) we found that the software xml_grep did exactly what we wanted. This uses the module XML::Twig for Perl (http://www.xmltwig.org/xmltwig/). Perl is a fantastic scripting language – well worth learning, and excellent for file manipulation.

As Kevin notes, the tool was swiftly installed in the Mac thus:

cd XML-Twig-3.50 (use latest version downloaded here)
perl Makefile.PL -y
make
make test
sudo make install

Once the tool was built and working, we could run the extraction we wanted, thus:

$> xml_grep --text_only --cond 'w:t' document.xml > extractedtext.txt

This produced a file holding the text of the document – which at least allowed our student to carry on with their work – albeit that the report needed reconstructing. Also it was interesting to see how the Office files are held as zipped XML format. In any case – phew! The learning point in all this of course is the BACK UP YOUR FILES!! This has of course been said many times πŸ˜‰