Monthly Archives: September 2016

Corrupt Word docx document

We all use Word for report writing and general word processing, but arghhhh, what to do when the file becomes corrupted!

One of the students here at Cranfield University, suffered a recent misfortune of corrupting a MS Word docx file. The file had been 30+ pages of closely written text, ready for a thesis meeting, when disaster struck. Somehow the document became corrupted and opened as a blank document with no text. Inspecting it, we realised the document size was still 1.5Mb – so the text was probably still in the file – even if we couldn’t see it.

We tried all means of tricks to cajole Word to open the file and recover the precious text, all to no avail. At a point frankly of some desparation, we remembered the docx file format is a zipped XML structure file. This is a saving grace – the earlier ‘doc’ format was just a proprietary binary format, now the ‘docx’ format offered some hope.

We took a copy of the file, and renamed it ‘’. This allowed us to open the zip archive, and to see the contents. Immediately we see the hierarchical structure of the document and the multiple files it contains.
MS Word XML file structure

We could then open the folder ‘word’, which showed the principle contents of the document.
MS Word XML file structure

Straight away, we can see the sub-folder ‘media’ – this folder contained all the images that had been in the original document. Great – those were saved off. Now we needed to extract the text itself.

There is also the key file ‘document.xml’ – XML is XML is a software- and hardware-independent tool for storing and transporting data (, stored in plain text. We extracted the text and loaded in our favourite text editor (TextWrangler – for the Mac). Inspecting the XML file shows the usual structures of xml – all spun onto a single line, thus:
MS Word XML file structure

We could then hunt through the file and locate the text in the XML, noting the tags within which the text was recorded. In this case, we can see text tagged with <w:t>.
MS Word XML file structure

So now we needed an automated method to extract all the text from the document – to start with, no such tool is present on the Mac. However, thanks to Kevin Peck’s excellent blog here ( we found that the software xml_grep did exactly what we wanted. This uses the module XML::Twig for Perl ( Perl is a fantastic scripting language – well worth learning, and excellent for file manipulation.

As Kevin notes, the tool was swiftly installed in the Mac thus:

cd XML-Twig-3.50 (use latest version downloaded here)
perl Makefile.PL -y
make test
sudo make install

Once the tool was built and working, we could run the extraction we wanted, thus:

$> xml_grep --text_only --cond 'w:t' document.xml > extractedtext.txt

This produced a file holding the text of the document – which at least allowed our student to carry on with their work – albeit that the report needed reconstructing. Also it was interesting to see how the Office files are held as zipped XML format. In any case – phew! The learning point in all this of course is the BACK UP YOUR FILES!! This has of course been said many times 😉

ESRI Insights for ArcGIS – 2

ArcGIS Enterprise: Installing ArcGIS Portal 10.5 Beta


In the next part of our blog, the GIS team here at Cranfield University, are setting up the newly launched ESRI ArcGIS Insights app. We are documenting the process of installing and testing this software, adding some helpful commentary on the way that should hopefully help others tread the same path!

To get ESRI Insights for ArcGIS to work, there are a number of pre-requisites, which we will be installing step by step. These include:

  • ArcGIS Server 10.5Beta
  • Portal for ArcGIS
  • ArcGIS Web Adaptor
  • ArcGIS Data Store
  • An instance of MS SQLServer Database
  • A JDBC 4.0 Compliant driver

Having installed ArcGIS Server 10.5Beta successfully, we now need to install the Beta edition of ArcGIS 10.5 Portal. Portal for ArcGIS allows you to share maps, applications, and other geographic information with others, with the shared content being delivered through a website – which can then be customised as required.

ArcGIS Portal 10.5Beta

To get going with Portal, we copied over the Portal tarfile from the early adopter site ‘Portal_for_ArcGIS_Linux_105_beta1.tar.gz’, which contains the installation files for the Portal installation. This tar/gz file contains a folder called ‘Documentation’ – within which is a web/html set of instructions. We found it useful to extract these files off to a separate computer for consultation as the process unfolded.

From the earlier ArcGIS 10.5 Beta Server installation, we had already downloaded to our installation folder (our home directory on the test server) the sample provisioning authorisation file provided ‘ArcGISforServerAdvancedEnterprise_Server_105alpha.prvc’. We had edited the header for this file with our details, but apart from that we left the codes alone.

The next step was to unpack the installation media, thus:
~> gunzip Portal_for_ArcGIS_Linux_105_beta1.tar.gz
~> tar –xvf Portal_for_ArcGIS_Linux_105_beta1.tar

This creates a folder ‘PortalForArcGIS’ with all the installation media in it ready to go. From here, we ran the setup programme, thus:
~> cd PortalForArcGIS
~> ./Setup -l yes -m console

Note the use of the console mode installation

This seemed to run OK, but we noticed a warning flagged up.

Portal for ArcGIS 10.5 Diagnostic Tool
DIAG000: Check for installation as root [PASSED]
DIAG001: Check for 64-bit architecture [PASSED]
DIAG002: Check OS version [PASSED]
DIAG003: Check hostname for invalid characters [PASSED]
DIAG005: Check system limits [PASSED]
DIAG004: Check installed packages [WARNING]
DIAG016: Check Portal for ArcGIS ports [PASSED]
DIAG024: Check localhost resolution [PASSED]
There were 0 failure(s) and 1 warning(s) found:
*** DIAG004: The following required packages were not found:

‘dos2unix’ huh? OK, so we will clearly need that too. The old PC/Unix text file end of line issue – this is nothing to do with ESRI’s software. Dos2Unix is just a useful utility to help move text files between the various end of line formats.

We installed the dos2unix tool by compiling the source code and installing it manually and testing it was working OK – but we continued to see the same warning message from Portal. Puzzling – then we read the warning again – it says ‘package’ not found. OK it is looking at the package manager inventory to see what is installed. We re-installed dos2unix with the package manager ‘yum’.

~> yum install dos2unix
Note that yum is the redhat equivalent of ‘apt-get install’
That all worked fine and the Portal check passed OK. The installation of Portal now continued. The terms and conditions are shown for review, and the default locations (home/arcgis) – we accepted these.

Portal Authorisation

As part of the installation, we provided the explicit path to the provisioning file made available as part of the beta download ‘PortalforArcGISServer_105alpha.prvc’. However, this was not accepted, and so we needed to try again using the script authorisation tool ‘authorizeSoftware’ provided in the installation. However, remembering we also had this issue when installing Server, and how we fixed that, we edited the provisioning file header with our own details, saving to a new file ‘PortalforArcGISServer_105alpha_edited.prvc’.

Note that using ‘authorizeSoftware -f’ allows a later application of a provisioning file to an installation, thus:
~/arcgis/portal/tools> ./authorizeSoftware -f /home/gisadmin/PortalforArcGISServer_105alpha_edited.prvc

~/arcgis/portal/tools> ./authorizeSoftware -s
Starting the ArcGIS Software Authorization Wizard

Run this script with -h for additional information.
Product Ver ECP# Expires
portal_1000 101 ecp906762680 30-jan-2017

[shows all software licenced]

So far so good, the software is all installed and licenced. Now time to fire up Portal for the first time:

We entered the URL of the server:
(substitute ‘localhost’ for the fully qualified URL of your server)

At first this address did not work. We realised that this was again as the firewall needed updating for this port number (7443), as it was blocking it by default. A quick few edits in ‘lokkit‘ later and we are up and running.

~> sudo lokkit --port= 7443:tcp --update

ArcGIS Portal

As a new install, we selected ‘Create new Portal’, then completed the form presented.

ArcGIS Portal

Web Adapter

ESRI say ‘ArcGIS Web Adaptor is a required component of Portal for ArcGIS which allows you to integrate your portal with your existing web server and your organization’s security mechanisms’, noting that ‘the ArcGIS Web Adaptor (Java Platform) on Linux allows you to integrate your existing Java-based web server with ArcGIS Server and Portal for ArcGIS.’ So we need to install Web Adapter too.

The gz/tar file for Web Adapter was downloaded and unpacked as with the earlier archive distribution files.

~> gunzip Web_Adaptor_Java_Linux_105_beta1.tar.gz
~> tar -xvf ./Web_Adaptor_Java_Linux_105_beta1.tar

This creates a folder ‘WebAdaptor’ with all the installation media in it ready to go. From here, we ran the setup programme, thus:
~> cd WebAdaptor
~> ./Setup -l yes -m silent
[ArcGIS 10.5 Web Adaptor (Java Platform) Installation Details]
UI Mode..................silent
Agreed to Esri License...yes
Installation Directory.../home/USER/arcgis/webadaptor10.5
Starting installation of ArcGIS 10.5 Web Adaptor (Java Platform)...
...ArcGIS 10.5 Web Adaptor (Java Platform) installation is complete.

Note the ‘-m’ silent mode setting used in the setup.

So at the end of this blog, we now have ArcGIS Server, ArcGIS Portal and the WebAdapter all up and running.

Lastly, we can make sure all the ports we need are opened correctly through the firewall.
~> sudo service iptables status

In the next blog, we will continue installing the other pre-requisite software tools for ESRI Insights for ArcGIS.

Thanks for reading!