Category Archives: Big Data

Matters relating to the handling of large datasets

Exploring traffic times data

A recent investigation here at Cranfield University considered the sources of road journey traffic time data, and this blog recounts some of that investigation. First of all comes the sources of the data.

Highways England Data

Thanks to the fantastic open data revolution we now have a huge wealth of public data available via the www.data.gov.uk portal. Here for example we can source data on traffic times from the Highways England agency. Their traffic times data can be obtained from https://data.gov.uk/dataset/dft-eng-srn-routes-journey-times

This data series provides average journey time, speed and traffic flow information for 15-minute periods since April 2009 on all morotways and ‘A’ roads managed by the Highways Agency, known as the Strategic Road Network, in England, with journey times and speeds estimated using a combination of sources, including Automatic Number Plate Recognition (ANPR) cameras, in-vehicle Global Positioning Systems (GPS) and inductive loops built into the road surface.

For example, we downloaded the CSV file: ‘Feb15.csv‘ relating to  February 2015 data. The first line of which by example reads:

LinkRef Link Description Date Time Period AverageJT Average Speed Data Quality Link Length Flow
AL215 A120 between A133 and A1232 (AL215) 2015-02-10 00:00:00 67 305.47 105.12 1 8.9200000762939453 286.50

This line of data relates to a stretch of road north of Colchester, UK on the A120. Key information here being that on 10th February 2015, for this c.9km stretch of road, it took 287 seconds (c4.8mins) to drive. The time of day is given as 67. This number is one of 96 15-minute intervals in the day that the data refers to (0-95 where 0 indicates 00:00 to 00:15). 67 is therefore 4:45:00 PM to 5:00:00 PM (see a useful table at the end of this article for working this out).

Google Traffic Data

Another useful source of data is from Google. The Google routing and traffic functions can be used by making a call to the Google ‘Distance Matrix’ API, described here:

https://developers.google.com/maps/documentation/distance-matrix/intro#traffic-model

Using the excellent ‘Postman‘ tool, We can formulate and test a REST call to the Google distancematrix API.

https://maps.googleapis.com/maps/api/distancematrix/json?units=metric&origins=enc:{s{{H{ovD:&destinations=enc:g{t{HqtiE:&departure_time=now&traffic_model=best_guess&key=<API KEY>

Parameters for this API are as follows:
units = metric values (e.g. km)
origins = startint point (encoded)
destinations = finish point (encoded)
departure time = can’t be historical, ‘now’ = keyword
traffic model = best guess (not optimistic/pessimistic)
API key = the personal API

The parameters origins and  destinations hold locations in latitude and longitude. As an alternative to decimal degree values for these, there can be encoded values used in the URL. To encode loctions the polyline utility can be used: See https://developers.google.com/maps/documentation/utilities/polylineutility

The resultant response to this REST call, made using Postman to send query off (GET), is:

{
    "destination_addresses": [
        "A120, Colchester CO7, UK"
    ],
    "origin_addresses": [
        "A120, Ardleigh, Colchester CO7, UK"
    ],
    "rows": [
        {
            "elements": [
                {
                    "distance": {
                        "text": "8.0 km",
                        "value": 7993
                    },
                    "duration": {
                        "text": "5 mins",
                        "value": 278
                    },
                    "duration_in_traffic": {
                        "text": "5 mins",
                        "value": 301
                    },
                    "status": "OK"
                }
            ]
        }
    ],
    "status": "OK"
}

Key information here being that at the time of making the call (‘now’), for this c.8km stretch of road, it took between 278 to 301 seconds (c4.6 to 5.0 mins) to drive. Key to this is the difference between the ‘duration’ and ‘duration_in_traffic’ values. Google note the allows you to ‘receive a route and trip duration (response field: duration_in_traffic) that take traffic conditions into account’. Note that ‘the departure_time must be set to the current time or some time in the future. It cannot be in the past’.

So in this way the Google approach allows a definition of the delays in drive time caused by traffic conditions. Although this cannot be determined retrospectively, a speculative future date can be selected whereby a prediction is made based on previous traffic conditions.


Utilities

The table used to calculate the time period for the Highways England data, described above:

Period From To
0 12:00:00 AM 12:15:00 AM
1 12:15:00 AM 12:30:00 AM
2 12:30:00 AM 12:45:00 AM
3 12:45:00 AM 1:00:00 AM
4 1:00:00 AM 1:15:00 AM
5 1:15:00 AM 1:30:00 AM
6 1:30:00 AM 1:45:00 AM
7 1:45:00 AM 2:00:00 AM
8 2:00:00 AM 2:15:00 AM
9 2:15:00 AM 2:30:00 AM
10 2:30:00 AM 2:45:00 AM
11 2:45:00 AM 3:00:00 AM
12 3:00:00 AM 3:15:00 AM
13 3:15:00 AM 3:30:00 AM
14 3:30:00 AM 3:45:00 AM
15 3:45:00 AM 4:00:00 AM
16 4:00:00 AM 4:15:00 AM
17 4:15:00 AM 4:30:00 AM
18 4:30:00 AM 4:45:00 AM
19 4:45:00 AM 5:00:00 AM
20 5:00:00 AM 5:15:00 AM
21 5:15:00 AM 5:30:00 AM
22 5:30:00 AM 5:45:00 AM
23 5:45:00 AM 6:00:00 AM
24 6:00:00 AM 6:15:00 AM
25 6:15:00 AM 6:30:00 AM
26 6:30:00 AM 6:45:00 AM
27 6:45:00 AM 7:00:00 AM
28 7:00:00 AM 7:15:00 AM
29 7:15:00 AM 7:30:00 AM
30 7:30:00 AM 7:45:00 AM
31 7:45:00 AM 8:00:00 AM
32 8:00:00 AM 8:15:00 AM
33 8:15:00 AM 8:30:00 AM
34 8:30:00 AM 8:45:00 AM
35 8:45:00 AM 9:00:00 AM
36 9:00:00 AM 9:15:00 AM
37 9:15:00 AM 9:30:00 AM
38 9:30:00 AM 9:45:00 AM
39 9:45:00 AM 10:00:00 AM
40 10:00:00 AM 10:15:00 AM
41 10:15:00 AM 10:30:00 AM
42 10:30:00 AM 10:45:00 AM
43 10:45:00 AM 11:00:00 AM
44 11:00:00 AM 11:15:00 AM
45 11:15:00 AM 11:30:00 AM
46 11:30:00 AM 11:45:00 AM
47 11:45:00 AM 12:00:00 PM
48 12:00:00 PM 12:15:00 PM
49 12:15:00 PM 12:30:00 PM
50 12:30:00 PM 12:45:00 PM
51 12:45:00 PM 1:00:00 PM
52 1:00:00 PM 1:15:00 PM
53 1:15:00 PM 1:30:00 PM
54 1:30:00 PM 1:45:00 PM
55 1:45:00 PM 2:00:00 PM
56 2:00:00 PM 2:15:00 PM
57 2:15:00 PM 2:30:00 PM
58 2:30:00 PM 2:45:00 PM
59 2:45:00 PM 3:00:00 PM
60 3:00:00 PM 3:15:00 PM
61 3:15:00 PM 3:30:00 PM
62 3:30:00 PM 3:45:00 PM
63 3:45:00 PM 4:00:00 PM
64 4:00:00 PM 4:15:00 PM
65 4:15:00 PM 4:30:00 PM
66 4:30:00 PM 4:45:00 PM
67 4:45:00 PM 5:00:00 PM
68 5:00:00 PM 5:15:00 PM
69 5:15:00 PM 5:30:00 PM
70 5:30:00 PM 5:45:00 PM
71 5:45:00 PM 6:00:00 PM
72 6:00:00 PM 6:15:00 PM
73 6:15:00 PM 6:30:00 PM
74 6:30:00 PM 6:45:00 PM
75 6:45:00 PM 7:00:00 PM
76 7:00:00 PM 7:15:00 PM
77 7:15:00 PM 7:30:00 PM
78 7:30:00 PM 7:45:00 PM
79 7:45:00 PM 8:00:00 PM
80 8:00:00 PM 8:15:00 PM
81 8:15:00 PM 8:30:00 PM
82 8:30:00 PM 8:45:00 PM
83 8:45:00 PM 9:00:00 PM
84 9:00:00 PM 9:15:00 PM
85 9:15:00 PM 9:30:00 PM
86 9:30:00 PM 9:45:00 PM
87 9:45:00 PM 10:00:00 PM
88 10:00:00 PM 10:15:00 PM
89 10:15:00 PM 10:30:00 PM
90 10:30:00 PM 10:45:00 PM
91 10:45:00 PM 11:00:00 PM
92 11:00:00 PM 11:15:00 PM
93 11:15:00 PM 11:30:00 PM
94 11:30:00 PM 11:45:00 PM
95 11:45:00 PM 12:00:00 AM

Apache Spark, Zeppelin and geospatial big data processing

There is much interest here at Cranfield University in the use of Big Data tools, and with our parallel interests in all things geospatial, the question arises – how can Big Data tools process geospatial data?

In this blog, we investigate the use of Apache Spark, Apache Zeppelin and a couple of geospatial libraries. In an earlier blog, we set up Spark and Zeppelin, and now we extend this to use these additional tools. Note that this exercise is undertaken with a MacBook, although the instructions should work with Linux just as well.

There are few geospatial libraries for Big Data processing that work with Spark/Hadoop. Some of those that exist include the Hadoop offering from ESRI, Magellan, and GeoSpark.

GeoSpark

To set up GeoSpark, we downloaded the library ‘geospark-0.3.2-spark-2.x.jar’ from https://github.com/DataSystemsLab/GeoSpark/releases and saved the file off locally, e.g. to

/Users/sparkuser/spark/jars/

Next, in the Apache Spark installation ‘conf’ folder, we copied the template file ‘spark-defaults.conf.template’ to ‘spark-defaults.conf’ ready for editing – we need to tell Spark to use the GeoSpark jar library.

Now, we edited the conf configuration file to add the line at the end to reference the jar, e.g.

spark.jars /Users/sparkuser/spark/jars/geospark-0.3.2-spark-2.x.jar

Sourcing data

We need some spatial data for our test. We downloaded sample data files ‘zcta510-small.csv‘ and ‘arealm-small.csv‘ (online as above), to a local data location, e.g. /Users/sparkuser/spark/data/geospark.

The datasets take the following form:
arealm-small.csv

-88.331492,32.324142
-88.175933,32.360763
-88.388954,32.357073
-88.221102,32.35078
-88.323995,32.950671
...

zcta510-small.csv

-155.940114,19.081331,-155.618917,19.5307
-155.335476,19.802474,-155.104434,19.93224
-155.85966,20.120695,-155.765027,20.268469
-155.396864,19.519641,-154.987674,19.800274
-155.98572,19.53958,-155.822977,19.70849
...

The code

We now followed exactly the GeoSpark example tutorial code, in the Scala language.
First, we need to ensure the correct libraries are loaded and available:

import org.datasyslab.geospark.spatialOperator.RangeQuery
import org.datasyslab.geospark.spatialRDD.PointRDD
import org.datasyslab.geospark.spatialOperator.JoinQuery
import org.datasyslab.geospark.spatialRDD.RectangleRDD
import com.vividsolutions.jts.geom.Envelope
import org.datasyslab.geospark.spatialOperator.KNNQuery
import org.datasyslab.geospark.spatialRDD.PointRDD
import com.vividsolutions.jts.geom.Coordinate
import com.vividsolutions.jts.geom.GeometryFactory
import com.vividsolutions.jts.geom.Point

Now we can run the following code and observe the following:

// Start an example Spatial Range Query without Index
val queryEnvelope=new Envelope (-113.79,-109.73,32.99,35.08);
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val resultSize = RangeQuery.SpatialRangeQuery(objectRDD, queryEnvelope, 0).getRawPointRDD().count(); /* The O means consider a point only if it is fully covered by the query window when doing query */


queryEnvelope: com.vividsolutions.jts.geom.Envelope = Env[-113.79 : -109.73, 32.99 : 35.08]
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@52b8d9a6
resultSize: Long = 445

// Start an example Spatial Range Query with Index
val queryEnvelope=new Envelope (-113.79,-109.73,32.99,35.08);
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
objectRDD.buildIndex("rtree"); /* Build R-Tree index */
val resultSize = RangeQuery.SpatialRangeQueryUsingIndex(objectRDD, queryEnvelope,0).getRawPointRDD().count(); /* The O means consider a point only if it is fully covered by the query window when doing query */

queryEnvelope: com.vividsolutions.jts.geom.Envelope = Env[-113.79 : -109.73, 32.99 : 35.08]
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@2c3e8ebf
resultSize: Long = 445

// Start an example Spatial KNN Query without Index
val fact=new GeometryFactory();
val queryPoint=fact.createPoint(new Coordinate(-109.73, 35.08));
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val resultSize = KNNQuery.SpatialKnnQuery(objectRDD, queryPoint, 5); /* The number 5 means 5 nearest neighbors */

fact: com.vividsolutions.jts.geom.GeometryFactory = com.vividsolutions.jts.geom.GeometryFactory@35f6b599
queryPoint: com.vividsolutions.jts.geom.Point = POINT (-109.73 35.08)
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@76d6439b
resultSize: java.util.List[com.vividsolutions.jts.geom.Point] = [POINT (-109.538914 35.123446), POINT (-108.729849 37.196678), POINT (-117.105253 33.48551), POINT (-120.679839 35.25764), POINT (-120.860368 35.398047)]

// Start an example Spatial KNN Query with Index
val fact=new GeometryFactory();
val queryPoint=fact.createPoint(new Coordinate(-109.73, 35.08));
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
objectRDD.buildIndex("rtree"); /* Build R-Tree index */
val resultSize = KNNQuery.SpatialKnnQueryUsingIndex(objectRDD, queryPoint, 5); /* The number 5 means 5 nearest neighbors */

fact: com.vividsolutions.jts.geom.GeometryFactory = com.vividsolutions.jts.geom.GeometryFactory@24046396
queryPoint: com.vividsolutions.jts.geom.Point = POINT (-109.73 35.08)
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@6db7719d
resultSize: java.util.List[com.vividsolutions.jts.geom.Point] = [POINT (-109.538914 35.123446), POINT (-108.729849 37.196678), POINT (-108.135158 37.242491), POINT (-107.596572 37.000003), POINT (-107.79524 37.225479)]

// Start an example Spatial Join Query without Index
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0 ,"csv","rtree",4); /* The O means spatial attribute starts at Column 0, number 4 means 4 RDD partitions, "rtree" means use R-Tree Spatial Partitioning Grid */
val rectangleRDD = new RectangleRDD(sc, "/Users/sparkuser/spark/data/geospark/zcta510-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val joinQuery = new JoinQuery(sc,objectRDD,rectangleRDD);
val resultSize = joinQuery.SpatialJoinQuery(objectRDD,rectangleRDD).count();
objectRDD.totalNumberOfRecords  /* see https://github.com/DataSystemsLab/GeoSpark/blob/master/src/main/java/org/datasyslab/geospark/spatialRDD/PointRDD.java for API */

objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@730e3723
rectangleRDD: org.datasyslab.geospark.spatialRDD.RectangleRDD = org.datasyslab.geospark.spatialRDD.RectangleRDD@2bf31c8c
joinQuery: org.datasyslab.geospark.spatialOperator.JoinQuery = org.datasyslab.geospark.spatialOperator.JoinQuery@36cecee7
resultSize: Long = 9989

// Start an example Spatial Join Query with Index
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0 ,"csv","rtree",4); /* The O means spatial attribute starts at Column 0, number 4 means 4 RDD partitions, "rtree" means use R-Tree Spatial Partitioning Grid */
val rectangleRDD = new RectangleRDD(sc, "/Users/sparkuser/spark/data/geospark/zcta510-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val joinQuery = new JoinQuery(sc,objectRDD,rectangleRDD);
objectRDD.buildIndex("rtree"); /* Build R-Tree index */
val resultSize = joinQuery.SpatialJoinQueryUsingIndex(objectRDD,rectangleRDD).count();

objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@1301fbdd
rectangleRDD: org.datasyslab.geospark.spatialRDD.RectangleRDD = org.datasyslab.geospark.spatialRDD.RectangleRDD@ebfb5e7
joinQuery: org.datasyslab.geospark.spatialOperator.JoinQuery = org.datasyslab.geospark.spatialOperator.JoinQuery@197ff4a6
resultSize: Long = 9989

Apache Spark and Zeppelin – Big Data Tools

Cranfield University students and staff recently joined other members of the DREAM Centre for Doctoral Training in Big Data, on the excellent ‘Winter School’ in Big Data at the Hartree Centre, the UK’s pre-eminent centre for Big Data technology. We were able to explore the impressive capability of the Apache Spark environment on the Hartree’s IBM compute cluster.

Learning Apache Spark™ offers a useful insight into Big Data processing, and the opportunities available to handling data at scale. Spark is a fast and general engine for large-scale data processing, and has emerged as the software ‘ecosystem’ of choice for contemporary Big Data processing. Its huge advantage over earlier Big Data tool approaches is that it runs all its operations sequentially in memory, avoiding the cost of successive disk operations; as a consequence it is very quick. Spark has four key modules that allow powerful, but complimentary data processing: ‘SQL and DataFrames’, ‘Spark Streaming’, ‘MLlib’ (machine learning) and ‘GraphX’ (graph).

The good news is that one can learn Spark in a number of ways, all at no cost. Most of the big cloud providers who provide Spark offer ‘community accounts’ where one can register a free account in order to learn (e.g. IBM Data Science Experience, databricks and MS Azure to name a few). However, Spark can also be installed locally on a laptop which, if it has a multi-core processor, can then do some parallel processing of a sort: certainly enough for our learning purposes. It is therefore the installation of a local Big Data Spark Environment on a MacBook laptop that forms the basis for this post, (clearly this will all also work on Linux too).

In addition to Spark, this post also allows us to explore the use of the Apache Zeppelin™ notebook environment. Notebooks are a fantastic way to keep a record of projects, with processing code and contextual information all kept in one document. For this whole project exercise then we undertook the following steps:

Load up some sample CSV data

As a very first step, we wanted to download some sample data onto the local disk that could be representative of ‘Big Data’. The CSV format (Comma Separated Values) is widely used as a means of working with large datasets, so we will use this. The Apache Foundation themselves have a number of example files – so we will use one of them – ‘bank.csv’. To pull a file in locally, use the ‘curl‘ command, thus:

curl "https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv" -o "bank.csv"

On other systems, the ‘wget‘ command can also be used (e.g. on linux). After this we have a file ready for later use.
CSV, Comma Separated Values file

Installing Spark

Next, we need to install Spark itself. The steps are as follows:
1. Go to http://spark.apache.org and select ‘Download Spark’.
2. We left the version number ‘drop down’ for version numbers at the latest (default): for us this was v2.0.2
3. We downloaded the resultant file ‘spark-2.0.2-bin-hadoop2.7.tgz’.
4. We created a new folder ‘spark’ in our user home directory, and opening a terminal window, we unpacked the file thus:
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz.
5. After this we checked the files are all present in /Users/geothread/spark/spark-2.0.2-bin-hadoop2.7.
The next step is that the configuration needs checking. In the terminal, move to the conf spark folder:
cd /Users/geothread/spark/spark-2.0.2-bin-hadoop2.7/conf.
6. Templates. Note in the conf file there are a load of files which end *.template (e.g. ‘spark-defaults.conf.template’). These template files are provided for you to edit as required. If you need to do this, you copy the template file, removing the suffix first, then edit as required (e.g. cp spark-defaults.conf.template spark-defaults.conf). In fact, we will leave these default settings as they are for now in our local installation.
7. Running Spark. To run Spark, in terminal, move to the bin folder. We will start off by running scala. Scala is the programming language that Spark is mostly written in, but can also be run at the command line. In running Scala, we can note how the spark context ‘sc’ is made available for use (the spark context is the ‘instance’ of spark that is running):

bin$> ls
beeline pyspark2.cmd spark-shell2.cmd
beeline.cmd run-example spark-sql
derby.log run-example.cmd spark-submit
load-spark-env.cmd spark-class spark-submit.cmd
load-spark-env.sh spark-class.cmd spark-submit2.cmd
metastore_db spark-class2.cmd sparkR
pyspark spark-shell sparkR.cmd
pyspark.cmd spark-shell.cmd sparkR2.cmd

bin$> ./spark-shell
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@8077c97
scala> System.exit(1)

Now instead we will switch to Python. We will try running Python with the API designed to expose it to Spark, pyspark, and so now we can also load and do a line count of that sample CSV data downloaded earlier. Note, the spark context sc is again made available:

bin$> ./pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Python version 2.7.10 (default, Jul 30 2016 18:31:42)
SparkSession available as 'spark'.
>>> sc
<pyspark.context.SparkContext object at 0x10b021f10>
>>> df = spark.read.csv("/Users/geothread/bank.csv", header=True, mode="DROPMALFORMED")
>>> df.count()
4521
>>> exit()

Monitoring jobs

We can also go and check up on the spark jobs we ran, by accessing the web dashboard installed in Spark. It runs by default on ‘port‘ 4040, so note this number must be added to the URL after the colon, thus:

http://localhost:4040/jobs/
Spark dashboard
Hopefully this all works OK and the dashboard can be accessed. The next step is to install and configure the Apache Zeppelin notebook.

Installing Zeppelin

Apache Zeppelin offers a web-based notebook enabling interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. Zeppelin installs and runs as a server – so there are some fiddly bits to getting it going. Under the bonnet, notebooks are saved off in JSON format – but we don’t really need to know this just to use it.
Apache Zeppelin
To obtain and run the Zeppelin notebook, use following steps:
1. Go to https://zeppelin.apache.org and ‘Get Download’.
2. Save off and unpack the file to a new folder created in your home folder, e.g. ‘Users/geothread/zeppelin’.
tar -xvf zeppelin-0.6.2-bin-all.tar
3. Go to the conf folder
cd conf
As before, note the template files, look at the file ‘zeppelin-site.xml.template’ – Zeppelin will run on port 8080 by default. If you need to change this (and we did – we needed it to use port 9080 instead), you can make a copy of this file.
cp zeppelin-site.xml.template zeppelin-site.xml
4. Edit the new file with your favourite text editor, (e.g. with nano), to change the port as required.
5. Also in this file, if you are running the zeppelin server locally then you can also edit the server IP to ‘localhost’. When we’d finished editing for our server, the file was as follows (in summary the two edits were to add ‘localhost’ and ‘9080’):

<property>
<name>zeppelin.server.addr</name>
<value>localhost</value>
<description>Server address</description>
</property>

<property>
<name>zeppelin.server.port</name>
<value>9080</value>
<description>Server port.</description>
</property>

6. At this point we ensured the following lines were in the account .profile configuration file in the home folder noting that, as an alternative, these settings can also be added locally in the configuration files in the zeppelin conf folder too.

export JAVA_HOME=$(/usr/libexec/java_home)
export SPARK_HOME="$HOME/spark/spark-2.0.2-bin-hadoop2.7"

7. The next step may or may not be necessary – it was for us. In fact we know this was necessary for us as we got the error message just like the one described online here:
8. Go to https://mvnrepository.com/artifact/com.fasterxml.jackson and download the files: ‘jackson-core-2.6.5.jar‘, ‘jackson-annotations-2.6.5.jar‘ and ‘jackson-databind-2.6.5.jar‘. Note these are not the latest files available! The latest jackson file version didn’t work for us – but v2.6.5 worked fine.
9. Go to the lib folder and remove them (best to just move to somewhere else, e.g. the downloads folder) the files ‘jackson-core-2.5.0.jar‘, ‘jackson-annotations-2.5.0.jar‘ and ‘jackson-databind-2.5.0.jar
10. Copy the three new downloaded v2.6.5 version files into the lib folder.
11. Now go to the bin folder and start the server (before using Zeppelin at any time, you will need to ensure the server is running):
./zeppelin-daemon.sh start
12. Note you can also stop and restart the daemon at any time, like this:
./zeppelin-daemon.sh restart and ./zeppelin-daemon.sh stop
You may find it useful to add some shortcuts to your .profile file to save time, for example (with each command being all on one line, and using the correct path of course):
alias zeppelin_start='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh start'
alias zeppelin_stop='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh stop'
alias zeppelin_restart='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh restart'

13. Next, open a browser and open the zeppelin notebook home page:
http://localhost:9080 (or whatever the port number is for you), and hopefully you are off, up and running.
Apache Zeppelin
14. Try running the sample notebooks provided. There are many online tutorials for Spark available – such as the excellent one here. So there is no need to reinvent the wheel repeating all that here on GeoThread. However, there are not so many tutorials showing how to integrate geospatial data into Big Data operations, an interest for us – so we hope a future blog will look at that.

If you want to know more about Zeppelin and get a walk through of its many features, watch this video by Moon soo Lee, the understated genius who created Zeppelin, speaking in Amsterdam at Spark Summit Europe 2015.

ESRI Insights for ArcGIS – 1

ArcGIS Enterprise: Installing ArcGIS Server 10.5 Beta

Introduction

Keenly anticipated here at Cranfield University, is the newly launched ESRI Insights for ArcGIS app, part of the new ArcGIS Enterprise suite, which, amongst other things, can be deployed to explore the use of Hadoop/HDFS technologies with geospatial data – offering powerful spatial analytics capabilities to this data.

So what is Insights for ArcGIS? Well ESRI have advised us that “Insights for ArcGIS is an app that you access through ArcGIS Enterprise that allows you to perform exploratory and iterative data analysis. With a minimal drag-and-drop interface, you can answer questions with data from ArcGIS services, Excel spreadsheets, and data warehouses.” This sounds great – we are also interested in its stated ability to handle Big Data databases, offering for all these sources easy access to the most widely used GIS analysis tools. Insights for ArcGIS is designed to enable easy analysis of data, revealing inherent patterns to gain situational awareness, as well as providing tools to explore what-if scenarios, presented in the form of connected charts, graphs, and maps. Cranfield are grateful to ESRI (UK) for the opportunity to act as a Beta tester for this new strategic tool from ESRI, drawing on established linkages via the DREAM Centre for Doctoral Training (CDT) in Big Data and Environmental Risk Mitigation.

In a series of blog pages which we will place here on Geothread, we will document the process of installing and testing this software, adding some helpful commentary on the way that should hopefully help others tread the same path!

ESRI Early Adopter Programme

The first act in this story is joining the ESRI Early Adopter programme (https://earlyadopter.esri.com). This provides first hand access to emergent software in Beta form.

Logging into the website, all the beta edition materials are made available for Insights for ArcGIS. A key early document to look at is the Insights for ArcGIS User Guide (insights_user_guideEAP2_1-1.pdf), which outlines the stages required for installation.

Insights for ArcGIS is part of the new ArcGIS Enterprise family from ESRI. We are informed by ESRI that ArcGIS Enterprise “is the next evolution of the ArcGIS Server product line, is a mapping and analytics platform that runs on your private infrastructure. It has a flexible deployment model allowing for use completely on-premises – connected or disconnected from the open internet – on physical hardware or virtualized environments, in the cloud on Amazon Web Services (AWS) or Microsoft Azure, or any other environment that meets the basic system requirements. This flexibility also allows you to add a variety of capabilities and distribute your deployment across infrastructure that supports your business needs.” Sounds good!

The ArcGIS Enterprise product includes the following software components:

  • ArcGIS Server
  • Portal for ArcGIS
  • ArcGIS Data Store
  • ArcGIS Web Adaptor

To get Insights for ArcGIS to work, we need to install these pre-requisites, which we will be installing step by step. So we will post here a blog of our progress in installing all these bits, as follows:

  • ArcGIS Server 10.5Beta – we need 10.5 for this to work
  • Portal for ArcGIS
  • ArcGIS Web Adaptor
  • ArcGIS Data Store
  • An instance of MS SQLServer Database
  • A JDBC 4.0 Compliant driver

First things first – we need to install the Beta edition of ArcGIS 10.5 Server.

ArcGIS Server 10.5Beta

To get Insights for ArcGIS to work, we need to get ArcGIS Server 10.5 up and running. We fired up our trusty Linux server for this task. This server was already running an earlier 10.2 version of ArcGIS Server. The instructions make it clear one can upgrade – but we chose to go for a clean install by preference, so uninstalled ArcGIS Server 10.2 as a first act.

We now copied over the file from the early adopter site ‘ArcGIS_for_Server_Linux_105_beta1.tar.gz‘, which contains the installation files for the new installation. This tar/gz file contains a folder called ‘Documentation’ – within which is a web/html set of instructions. We found it useful to extract these files off to a separate computer for consultation as the process unfolded.

Next, we downloaded to our installation folder (our home directory on the test server) the sample provisioning authorisation file ‘ArcGISforServerAdvancedEnterprise_Server_105alpha.prvc‘. We edited the header for this file with our details, but apart from that left the actual codes alone. The key learning point here, was that one has to add a valid email to the header of the file. More detailed advice received from ESRI concerning this, was that:

“To authorize ArcGIS Server, use the following parameter to authorize ArcGIS Server using the provisioning file:
authorizeSoftware –f ArcGISforServerAdvancedEnterprise_Server_105alpha.prvc -e
since the provisioning file does not include an email address.”
So that is an alternative approach – editing in the email explicitly to the prvc file worked OK for us.

The next step was to unpack the installation media, thus:
~> gunzip ArcGIS_for_Server_Linux_105_beta1.tar.gz
~> tar –xvf ArcGIS_for_Server_Linux_105_beta1.tar

This creates a folder ‘ArcGISServer’ with all the installation media in it ready to go. From here, we ran the setup programme, thus:
~> cd ArcGISServer
~> ./Setup -l yes

Running ‘Setup -help’ shows all the options available. This starts of the console process of installing ArcGIS Server. You go though various pages about the installation destination (we accepted the defaults for all these choices), and you view the terms and conditions of the Beta software.

At the end of this process, the installation asks for the full path to the provisioning licence authorisation file. This was provided and the installation ran on. At the end of this, the prompt says to press enter to exit – clearly this act starts the server up as, until you get the prompt back, the server will not run.

The installation is placed by default in a folder ‘/home/arcgis/server’, (home as in the home folder of the installing user). Operationally, we might put the files somewhere more conventional (e.g. /opt), but this is fine for our testing purposes. Within the folder is a ‘tools’ folder with some useful utilities. The following were useful:

~> cd ~/arcgis/server/tools
~/arcgis/server/tools> ./serverinfo
Server: 10.50.0.6318
JRE: 1.8.0_65
Tomcat: 7.0.64.0
Geronimo: 2.2.2
Wine: wine-1.8-rc3-985-g4f7221d

~/arcgis/server/tools> ./patchnotification/patchnotification
===============================================================================
ArcGIS for Server and Extensions Patch Notification
===============================================================================
Installed components:
Component Version
ArcGIS Server 10.5
===============================================================================
Available Updates:
ArcGIS Server
(no updates available)
===============================================================================
Installed Patches:
(none)
===============================================================================
To browse a full list of Esri patches and service packs, visit the Esri Support site:
http://support.esri.com/en/downloads/patches-servicepacks/

~/arcgis/server/tools> ./authorizeSoftware -s
[shows all software licenced]

Note authorizeSoftware -f would allow later application of a provisioning file to an installation.

So far so good, software installed and licenced. Now time to fire up the interface:
We entered the URL of server for the first time:

http://SERVER.NAME.HERE:6080/arcgis/manager

All worked well. This was the first time Server was started, so we were asked if we wanted to create a New site, or join another – we selected create a new site:
ArcGIS Server

Next, we are asked to create a username and password for the Server manager:
ArcGIS Server

Next, we specify the root server folder and the configuration store location:
ArcGIS Server
A summary is shown:
ArcGIS Server
And now we waited whilst the installtion finished (actually quite a long wait – but we are patient!)
ArcGIS Server
Finally it completes and we can log in for the first time:
ArcGIS Server

There then followed problems for us! It had seemed like a ‘good idea’ to stop and restart the server (utilities in the folder ~/arcgis/server). However, this was not a good idea! We tried to log back into the Server manager URL, thus:

http://SERVER.NAME.HERE:6080/arcgis/manager

Port 6080 is the http communication port. ArcGIS Server immediately tries to switch to HTTPS on port 6443, e.g.:

https://SERVER.NAME.HERE:6443/arcgis/manager

This didn’t work and we lost access to the server. It wasn’t obvious what had happened here at first, as our old version of ArcGIS Server 10.2 had worked fine. After investigation, it transpired the new https port of 6443 was not the same as the https port used in the earlier installation. The firewall was blocking the new port – quickly remedied. However, even then we still had a problem connecting. The URL was trying to connect to an apparently absent server. Fortunately, Server has some diagnostic tools.

~/arcgis/server/tools/> ./serverdiag/serverdiag
========================================================================
ArcGIS Server 10.5 Diagnostic Tool
Hostname: SERVER NAME
========================================================================
DIAG000: Check for installation as root [PASSED]
DIAG001: Check for 64-bit architecture [PASSED]
DIAG002: Check OS version [PASSED]
DIAG003: Check hostname for invalid characters [PASSED]
DIAG024: Check /etc/hosts for hostname entry [PASSED]
DIAG004: Check installed packages [PASSED]
DIAG005: Check system limits [PASSED]
DIAG008: Check HTTP port [PASSED]
DIAG009: Check HTTPS port [PASSED]
DIAG010: Check Xvfb ports [PASSED]
DIAG020: Check hostname IP address mismatches [PASSED]
DIAG026: Check processes for ArcGIS core services [PASSED]
DIAG020: Check hostname IP address mismatches [WARNING]
DIAG026: Check processes for ArcGIS core services [PASSED]
------------------------------------------------------------------------
There were 0 failure(s) and 1 warning(s) found:

ESRI have a good page explaining the diagnostic warnings here:
http://server.arcgis.com/en/server/latest/administer/linux/checking-server-diagnostics-using-the-diagnostics-tool.htm

We quickly realised there were inconsistencies in the server hosts file (nothing to do with ESRI), again quickly remedied. Finally the system started up and worked. We logged onto the Manager page:

ArcGIS Server

We noted the ‘Certificate error’ – the software actually provides its own default certificate just to get it all going – we can fix that later, there is lots of help online to do this – this is a test installation in any case:
ArcGIS Server

And now finally we see the main manager screen again:
ArcGIS Server

And then for the first time we can see data being served out of the ArcGIS Server 10.5 software – fantastic!
ArcGIS Server

In the next blog, we will start installing the other pre-requisite software tools for ESRI Insights for ArcGIS, starting with Portal for ArcGIS .

Thanks for reading!

The Internet of Things with Photon – Temperature and Humidity logging

Happy New Year from Geothread! Much is written about the Internet of Things, so here at Cranfield University as a post Christmas project, we wanted to explore some of the possibilities for interconnected devices, sensors and data streams. To do this we are using the fantastic ‘Photon’ microprocessor controller (formally called the Spark) from Particle (https://www.particle.io).

The inexpensive Photon device (https://www.particle.io/prototype) provides a microprocessor board and an array of digital and analogue pins for connecting up your sensors and actuators and a USB socket for providing power (and local data services). The Photon’s real strength lies in its onboard Broadcom WiFi chip. Whereas an Arduino or similar board is effectively self-contained and fiddly to connect to the rest of the world, the Photon board allows you to connect directly and immediately to the Particle cloud (a web service provided by Particle) to which all the data streams can be sent. It is therefore straight forward to develop a simple data logging application, streaming data onto the cloud for further processing and analysis. The Photon is also broadly code-compatible with the Arduino – so code can be transposed across easily.

If you are not on WiFi, Particle also offers the ‘Electron’ device, which offers the same capabilities, but takes a mobile phone SIM card instead of WiFi, allowing for remote access. Both the Photon and the Electron are really designed for prototyping up ideas; once you have a working design, you can use Particle’s PØ and P1 devices for mass production! Shown below is the Photon mounted onto a breadboard.

Photon on breadboard

The project at hand is to develop a simple data logger for temperature and humidity, using the trusty DHT11 sensor. In that sense, this project is similar to our earlier Bluetooth data logger – but now the data will go to the Internet via its WiFi connection (it can store up to 5 connections).

The steps required (broadly following the excellent Particle startup guide) are:

  1. Create an account on the Particle website portal – https://build.particle.io/login
  2. Download to your phone (e.g. iPhone/Android) the Particle ‘App’ and log in
  3. Power up the Photon (we used a standard USB micro B cable from a phone charger)
    1. We next need to get the Photon to connect to the local WiFi. Press and hold (carefully!) the Photon setup button to enter its setup mode
    2. Use the phone’s WiFi to connect to the WiFi from the Photon – the SSID is something like ‘Photon-XXX’ where ‘XXX’ is the unique number of the device . Note, we had terrible trouble connecting initially the Photon to a WEP encrypted broadband router. Turning off all router security worked fine – but this is no long-term solution. Enabling WPA router security however was all that was needed to ensure easy connection to the Photon (conclusion – use WPA not WEP security!!). The App guides you through introducing the Photon onto the network, and adding the WPA WiFi security phrase. Once the Photon is finally online, it can take a few minutes (6-12) to update its firmware – leave it alone to do this! You also get a chance to give your device a name – useful if you intend to have several devices.
  4. Next we need to wire up the Photon. A breadboard is a useful aid for initial prototyping.
    1. Connect pin 1 (on the left) of the sensor to +5V
    2. Connect pin 2 of the sensor to whatever your DHTPIN is
    3. Connect pin 4 (on the right) of the DHT11 sensor to GROUND
    4. Connect a 10K resistor from pin 2 (data) to pin 1 (power) of the sensor (we only had a 12k resistor handy but this was OK). Leave the device powered up ready to receive software code via the web.

Photon on breadboard with sensor attached

The next step moves from the hardware to the software. Particle offer a number of means to control and programme the photon. The phone App itself has a ‘tinker’ mode which allows one to turn on and off on-board LEDs etc. Next up, there is a web-based development environment (IDE) (https://build.particle.io/build/) – a very elegant solution to programming the device. Next there is a programme that can be installed, the ‘Particle Dev‘ (rather like the Arduino IDE), and finally command-line directives using Node.JS. To start with at least, it is easiest to use the web IDE interface. Also, in many ways the whole idea of the Internet of Things is to use cloud services – so data collection should also be a cloud-based activity.

Particle Web IDE development

To get us going, we selected the ‘Community Library’ called ‘ADAFRUIT_DHT’ developed by Adafruit (they produce great microprocessor kit too by the way). Their ‘dht-test.ino’ code can be adapted and edited, and the library added to the project. For editing, you will need to indicate the digital pin the DHT11 is connected to, e.g. for pin 2 ‘#define DHTPIN 2’. Also the type of sensor, e.g. for DHT11 ‘#define DHTTYPE DHT11’. One can also edit the loop delay for taking readings (e.g. for 5 seconds, ‘delay(5000);’).

In the run loop, we can also add instructions to publish the data readings to the Particle cloud. This is done by adding the lines:

Particle.publish("Humidity", String(h));
Particle.publish("Temperature", String(t));
Particle.publish("Dew point", String(dp));
Particle.publish("Heat Index", String(hi));

Once ready, the code can be flashed to (written to) the Photon device, over the Internet – neat!
And that is it – the Photon should now be up and running logging temperature and humidity data etc every 5 seconds. With thanks and acknowledgements to Adafruit, the software code used is shown at the end of this article.

The next task is to recover the data arriving on the Particle cloud originating from the device. There are a number of ways to do this, but the easiest initial means is to use the Particle Dashboard (see https://dashboard.particle.io/user/logs). This allows you to connect to, receive and visualise data from your running device.

Particle Dashboard showing data streaming in

You can see the data arriving at the dashboard, each reading being timestamped.

Enhancements for this project

This project is only the start. One can capture and store data streams arriving from the Photon in a database. The database can then be consulted to produce time series runs of data. Multiple Photon devices can be scattered across an area, and a web map of interpolated meteorological data be produced. Other sensors can be added (e.g. a GPS) and so on for locational advice. The whole assembly can be ruggedised in a waterproof box. Really there are so many ways to develop and enhance the basic concept.

What comes next?

The Particle Photon (and Electron) are truly amazing devices – so powerful and so easy to connect up to the Internet. Truly these devices can contribute to the ‘Internet of Things’. To get some real inspiration as to the sorts of projects that exist for these devices, visit https://particle.hackster.io. If you want to store the data arising from the sensor, also have a look at https://data.sparkfun.com/


Here is the software code used in this prototype:

// This #include statement was automatically added by the Particle IDE.
#include "Adafruit_DHT/Adafruit_DHT.h"

// Example testing sketch for various DHT humidity/temperature sensors
// Written by ladyada, public domain

#define DHTPIN 2 // what pin we’re connected to

// Uncomment whatever type you’re using!
#define DHTTYPE DHT11 // DHT 11
//#define DHTTYPE DHT22 // DHT 22 (AM2302)
//#define DHTTYPE DHT21 // DHT 21 (AM2301)

// Connect pin 1 (on the left) of the sensor to +5V
// Connect pin 2 of the sensor to whatever your DHTPIN is
// Connect pin 4 (on the right) of the sensor to GROUND
// Connect a 10K resistor from pin 2 (data) to pin 1 (power) of the sensor

DHT dht(DHTPIN, DHTTYPE);

void setup() {
Serial.begin(9600);
Serial.println(“DHT11 test!”);

dht.begin();
}

void loop() {
// Wait a few seconds between measurements.
delay(2000);

// Reading temperature or humidity takes about 250 milliseconds!
// Sensor readings may also be up to 2 seconds ‘old’ (its a
// very slow sensor)
float h = dht.getHumidity();
// Read temperature as Celsius
float t = dht.getTempCelcius();
// Read temperature as Farenheit
float f = dht.getTempFarenheit();

// Check if any reads failed and exit early (to try again).
if (isnan(h) || isnan(t) || isnan(f)) {
Serial.println(“Failed to read from DHT sensor!”);
return;
}

// Compute heat index
// Must send in temp in Fahrenheit!
float hi = dht.getHeatIndex();
float dp = dht.getDewPoint();
float k = dht.getTempKelvin();

Serial.print(“Humid: “);
Serial.print(h);
Serial.print(“% – “);
Serial.print(“Temp: “);
Serial.print(t);
Serial.print(“*C “);
Serial.print(f);
Serial.print(“*F “);
Serial.print(k);
Serial.print(“*K – “);
Serial.print(“DewP: “);
Serial.print(dp);
Serial.print(“*C – “);
Serial.print(“HeatI: “);
Serial.print(hi);
Serial.println(“*C”);
Serial.println(Time.timeStr());

Particle.publish(“Humidity”, String(h));
Particle.publish(“Temperature”, String(t));
Particle.publish(“Dew point”, String(dp));
Particle.publish(“Heat Index”, String(hi));
delay(5000);
}

Cranfield’s MSc Environmental Informatics renamed to MSc Environmental Data Science

Complex systemsHere at Cranfield University, we run a number of technically-oriented taught Masters courses. This website reflects some of the work of the staff involved in teaching and delivering on these courses.

One of our MSc courses is in ‘Environmental Informatics’.

From the next academic year in October 2015 onwards, we have decided to rename this course. So now the course is known as Cranfield University’s ‘MSc in Environmental Data Science’.

We made this change following careful consideration and advice from our industrial partners and student alumni. It was felt that the new name more closely reflected the ambitions of the course (data, modelling, visualisation and analytical techniques in the environmental sciences), and would stand alumni better in the subsequent jobs market (being a well understood term).

The course is described at http://www.cranfield.ac.uk/courses/masters/environmental-informatics.html and in fact all other aspects of the course remain the same (only the title is changed).

MSc Environmental Data ScienceCranfield University MSc Environmental Data Science