Exploring traffic times data

A recent investigation considered the sources of road journey traffic time data, and this blog recounts some of that investigation. First of all comes the sources of the data.

Highways England Data

Thanks to the fantastic open data revolution we now have a huge wealth of public data available via the www.data.gov.uk portal. Here for example we can source data on traffic times from the Highways England agency. Their traffic times data can be obtained from https://data.gov.uk/dataset/dft-eng-srn-routes-journey-times

This data series provides average journey time, speed and traffic flow information for 15-minute periods since April 2009 on all morotways and ‘A’ roads managed by the Highways Agency, known as the Strategic Road Network, in England, with journey times and speeds estimated using a combination of sources, including Automatic Number Plate Recognition (ANPR) cameras, in-vehicle Global Positioning Systems (GPS) and inductive loops built into the road surface.

For example, we downloaded the CSV file: ‘Feb15.csv‘ relating to  February 2015 data. The first line of which by example reads:

LinkRef Link Description Date Time Period AverageJT Average Speed Data Quality Link Length Flow
AL215 A120 between A133 and A1232 (AL215) 2015-02-10 00:00:00 67 305.47 105.12 1 8.9200000762939453 286.50

This line of data relates to a stretch of road north of Colchester, UK on the A120. Key information here being that on 10th February 2015, for this c.9km stretch of road, it took 287 seconds (c4.8mins) to drive. The time of day is given as 67. This number is one of 96 15-minute intervals in the day that the data refers to (0-95 where 0 indicates 00:00 to 00:15). 67 is therefore 4:45:00 PM to 5:00:00 PM (see a useful table at the end of this article for working this out).

Google Traffic Data

Another useful source of data is from Google. The Google routing and traffic functions can be used by making a call to the Google ‘Distance Matrix’ API, described here:

https://developers.google.com/maps/documentation/distance-matrix/intro#traffic-model

Using the excellent ‘Postman‘ tool, We can formulate and test a REST call to the Google distancematrix API.

https://maps.googleapis.com/maps/api/distancematrix/json?units=metric&origins=enc:{s{{H{ovD:&destinations=enc:g{t{HqtiE:&departure_time=now&traffic_model=best_guess&key=<API KEY>

Parameters for this API are as follows:
units = metric values (e.g. km)
origins = startint point (encoded)
destinations = finish point (encoded)
departure time = can’t be historical, ‘now’ = keyword
traffic model = best guess (not optimistic/pessimistic)
API key = the personal API

The parameters origins and  destinations hold locations in latitude and longitude. As an alternative to decimal degree values for these, there can be encoded values used in the URL. To encode loctions the polyline utility can be used: See https://developers.google.com/maps/documentation/utilities/polylineutility

The resultant response to this REST call, made using Postman to send query off (GET), is:

{
    "destination_addresses": [
        "A120, Colchester CO7, UK"
    ],
    "origin_addresses": [
        "A120, Ardleigh, Colchester CO7, UK"
    ],
    "rows": [
        {
            "elements": [
                {
                    "distance": {
                        "text": "8.0 km",
                        "value": 7993
                    },
                    "duration": {
                        "text": "5 mins",
                        "value": 278
                    },
                    "duration_in_traffic": {
                        "text": "5 mins",
                        "value": 301
                    },
                    "status": "OK"
                }
            ]
        }
    ],
    "status": "OK"
}

Key information here being that at the time of making the call (‘now’), for this c.8km stretch of road, it took between 278 to 301 seconds (c4.6 to 5.0 mins) to drive. Key to this is the difference between the ‘duration’ and ‘duration_in_traffic’ values. Google note the allows you to ‘receive a route and trip duration (response field: duration_in_traffic) that take traffic conditions into account’. Note that ‘the departure_time must be set to the current time or some time in the future. It cannot be in the past’.

So in this way the Google approach allows a definition of the delays in drive time caused by traffic conditions. Although this cannot be determined retrospectively, a speculative future date can be selected whereby a prediction is made based on previous traffic conditions.


Utilities

The table used to calculate the time period for the Highways England data, described above:

Period From To
0 12:00:00 AM 12:15:00 AM
1 12:15:00 AM 12:30:00 AM
2 12:30:00 AM 12:45:00 AM
3 12:45:00 AM 1:00:00 AM
4 1:00:00 AM 1:15:00 AM
5 1:15:00 AM 1:30:00 AM
6 1:30:00 AM 1:45:00 AM
7 1:45:00 AM 2:00:00 AM
8 2:00:00 AM 2:15:00 AM
9 2:15:00 AM 2:30:00 AM
10 2:30:00 AM 2:45:00 AM
11 2:45:00 AM 3:00:00 AM
12 3:00:00 AM 3:15:00 AM
13 3:15:00 AM 3:30:00 AM
14 3:30:00 AM 3:45:00 AM
15 3:45:00 AM 4:00:00 AM
16 4:00:00 AM 4:15:00 AM
17 4:15:00 AM 4:30:00 AM
18 4:30:00 AM 4:45:00 AM
19 4:45:00 AM 5:00:00 AM
20 5:00:00 AM 5:15:00 AM
21 5:15:00 AM 5:30:00 AM
22 5:30:00 AM 5:45:00 AM
23 5:45:00 AM 6:00:00 AM
24 6:00:00 AM 6:15:00 AM
25 6:15:00 AM 6:30:00 AM
26 6:30:00 AM 6:45:00 AM
27 6:45:00 AM 7:00:00 AM
28 7:00:00 AM 7:15:00 AM
29 7:15:00 AM 7:30:00 AM
30 7:30:00 AM 7:45:00 AM
31 7:45:00 AM 8:00:00 AM
32 8:00:00 AM 8:15:00 AM
33 8:15:00 AM 8:30:00 AM
34 8:30:00 AM 8:45:00 AM
35 8:45:00 AM 9:00:00 AM
36 9:00:00 AM 9:15:00 AM
37 9:15:00 AM 9:30:00 AM
38 9:30:00 AM 9:45:00 AM
39 9:45:00 AM 10:00:00 AM
40 10:00:00 AM 10:15:00 AM
41 10:15:00 AM 10:30:00 AM
42 10:30:00 AM 10:45:00 AM
43 10:45:00 AM 11:00:00 AM
44 11:00:00 AM 11:15:00 AM
45 11:15:00 AM 11:30:00 AM
46 11:30:00 AM 11:45:00 AM
47 11:45:00 AM 12:00:00 PM
48 12:00:00 PM 12:15:00 PM
49 12:15:00 PM 12:30:00 PM
50 12:30:00 PM 12:45:00 PM
51 12:45:00 PM 1:00:00 PM
52 1:00:00 PM 1:15:00 PM
53 1:15:00 PM 1:30:00 PM
54 1:30:00 PM 1:45:00 PM
55 1:45:00 PM 2:00:00 PM
56 2:00:00 PM 2:15:00 PM
57 2:15:00 PM 2:30:00 PM
58 2:30:00 PM 2:45:00 PM
59 2:45:00 PM 3:00:00 PM
60 3:00:00 PM 3:15:00 PM
61 3:15:00 PM 3:30:00 PM
62 3:30:00 PM 3:45:00 PM
63 3:45:00 PM 4:00:00 PM
64 4:00:00 PM 4:15:00 PM
65 4:15:00 PM 4:30:00 PM
66 4:30:00 PM 4:45:00 PM
67 4:45:00 PM 5:00:00 PM
68 5:00:00 PM 5:15:00 PM
69 5:15:00 PM 5:30:00 PM
70 5:30:00 PM 5:45:00 PM
71 5:45:00 PM 6:00:00 PM
72 6:00:00 PM 6:15:00 PM
73 6:15:00 PM 6:30:00 PM
74 6:30:00 PM 6:45:00 PM
75 6:45:00 PM 7:00:00 PM
76 7:00:00 PM 7:15:00 PM
77 7:15:00 PM 7:30:00 PM
78 7:30:00 PM 7:45:00 PM
79 7:45:00 PM 8:00:00 PM
80 8:00:00 PM 8:15:00 PM
81 8:15:00 PM 8:30:00 PM
82 8:30:00 PM 8:45:00 PM
83 8:45:00 PM 9:00:00 PM
84 9:00:00 PM 9:15:00 PM
85 9:15:00 PM 9:30:00 PM
86 9:30:00 PM 9:45:00 PM
87 9:45:00 PM 10:00:00 PM
88 10:00:00 PM 10:15:00 PM
89 10:15:00 PM 10:30:00 PM
90 10:30:00 PM 10:45:00 PM
91 10:45:00 PM 11:00:00 PM
92 11:00:00 PM 11:15:00 PM
93 11:15:00 PM 11:30:00 PM
94 11:30:00 PM 11:45:00 PM
95 11:45:00 PM 12:00:00 AM
Email this to someoneShare on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

Connecting a Raspberry Pi to eduroam wifi

Raspberry Pi connected to eduroam wifi

Connecting to eduroam

The built in wifi on board the new Raspberry Pi 3 doesn’t seem to connect to the eduroam wifi network here on campus, out of the box. Clicking on the wifi icon in the top right of the Pi’s desktop shows the eduroam network as a greyed out option. You should find that Cranfield Web and Cranfield Setup are both available to join though.

Select Cranfield Web/Setup, open a browser and go to the web address http://cat.eduroam.org. You may be prompted for a username and password, your Cranfield University username and password should be supplied here.

Once at cat.eduroam.org, select Cranfield University from the list of institutions when prompted and then press the button titled ‘Linux’. This should download a small script which you should run. You can either run this by clicking the file once it has downloaded at the bottom of your browser or navigating to its location within your filesystem and running it from there.

If the script does not initially run (but simply opens in a text editor), you may need to add executable permissions to the downloaded script. Navigate to the location of your script in a terminal window and run the command:

sudo chmod u+x <filename>

When the scripts runs it will prompt you with some dialogue boxes that ask for your eduroam username and password. Your eduroam username should take the form <username>@cranfield.ac.uk (where <username> is your regular Cranfield username) and you should use your regular Cranfield password.

The script will eventually tell you that it was unable to update your network settings, however the important config we need will be stored in a .cat_installer folder in your home directory. Using a terminal window, navigate to this folder:

cd .cat_installer

 You will see a file cat_installer.conf. The contents of this file need to be copied and pasted into the file wpa_supplicant.conf which resides in /etc/wpa_supplicant/. Navigate to this folder:

cd /etc/wpa_supplicant

You will need to edit the wpa_supplicant.conf file using superuser permissions. You can edit this file using the text editor nano:

sudo nano wpa_supplicant.conf

Using your right mouse button (ctrl+v doesn’t work here), paste in the config that you copied earlier at the bottom of the file.

Save this file (in nano, press ctrl+O), then exit (ctrl+x).

You will now need to restart the system:

sudo shutdown –r now

When the system boots back up to desktop again, you should now be connected to eduroam. Note, clicking on the wifi icon in the top right of the screen will show that you are connected to eduroam, but it will still be greyed out. This is fine.

Be aware that this configuration will have your password stored in plaintext in the wifi configuration file. It is, therefore, essential that access to your Pi is configured with a secure password as anyone able to the sudo command on the Pi will be able to read this file.  If you are planning for several people to have access to a terminal/ssh on your Pi, it is recommended you connect the device to the network via a wired ethernet connection.

Email this to someoneShare on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

Apache Spark, Zeppelin and geospatial big data processing

There is much interest here at Cranfield University in the use of Big Data tools, and with our parallel interests in all things geospatial, the question arises – how can Big Data tools process geospatial data?

In this blog, we investigate the use of Apache Spark, Apache Zeppelin and a couple of geospatial libraries. In an earlier blog, we set up Spark and Zeppelin, and now we extend this to use these additional tools. Note that this exercise is undertaken with a MacBook, although the instructions should work with Linux just as well.

There are few geospatial libraries for Big Data processing that work with Spark/Hadoop. Some of those that exist include the Hadoop offering from ESRI, Magellan, and GeoSpark.

GeoSpark

To set up GeoSpark, we downloaded the library ‘geospark-0.3.2-spark-2.x.jar’ from https://github.com/DataSystemsLab/GeoSpark/releases and saved the file off locally, e.g. to

/Users/sparkuser/spark/jars/

Next, in the Apache Spark installation ‘conf’ folder, we copied the template file ‘spark-defaults.conf.template’ to ‘spark-defaults.conf’ ready for editing – we need to tell Spark to use the GeoSpark jar library.

Now, we edited the conf configuration file to add the line at the end to reference the jar, e.g.

spark.jars /Users/sparkuser/spark/jars/geospark-0.3.2-spark-2.x.jar

Sourcing data

We need some spatial data for our test. We downloaded sample data files ‘zcta510-small.csv‘ and ‘arealm-small.csv‘ (online as above), to a local data location, e.g. /Users/sparkuser/spark/data/geospark.

The datasets take the following form:
arealm-small.csv

-88.331492,32.324142
-88.175933,32.360763
-88.388954,32.357073
-88.221102,32.35078
-88.323995,32.950671
...

zcta510-small.csv

-155.940114,19.081331,-155.618917,19.5307
-155.335476,19.802474,-155.104434,19.93224
-155.85966,20.120695,-155.765027,20.268469
-155.396864,19.519641,-154.987674,19.800274
-155.98572,19.53958,-155.822977,19.70849
...

The code

We now followed exactly the GeoSpark example tutorial code, in the Scala language.
First, we need to ensure the correct libraries are loaded and available:

import org.datasyslab.geospark.spatialOperator.RangeQuery
import org.datasyslab.geospark.spatialRDD.PointRDD
import org.datasyslab.geospark.spatialOperator.JoinQuery
import org.datasyslab.geospark.spatialRDD.RectangleRDD
import com.vividsolutions.jts.geom.Envelope
import org.datasyslab.geospark.spatialOperator.KNNQuery
import org.datasyslab.geospark.spatialRDD.PointRDD
import com.vividsolutions.jts.geom.Coordinate
import com.vividsolutions.jts.geom.GeometryFactory
import com.vividsolutions.jts.geom.Point

Now we can run the following code and observe the following:

// Start an example Spatial Range Query without Index
val queryEnvelope=new Envelope (-113.79,-109.73,32.99,35.08);
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val resultSize = RangeQuery.SpatialRangeQuery(objectRDD, queryEnvelope, 0).getRawPointRDD().count(); /* The O means consider a point only if it is fully covered by the query window when doing query */


queryEnvelope: com.vividsolutions.jts.geom.Envelope = Env[-113.79 : -109.73, 32.99 : 35.08]
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@52b8d9a6
resultSize: Long = 445

// Start an example Spatial Range Query with Index
val queryEnvelope=new Envelope (-113.79,-109.73,32.99,35.08);
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
objectRDD.buildIndex("rtree"); /* Build R-Tree index */
val resultSize = RangeQuery.SpatialRangeQueryUsingIndex(objectRDD, queryEnvelope,0).getRawPointRDD().count(); /* The O means consider a point only if it is fully covered by the query window when doing query */

queryEnvelope: com.vividsolutions.jts.geom.Envelope = Env[-113.79 : -109.73, 32.99 : 35.08]
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@2c3e8ebf
resultSize: Long = 445

// Start an example Spatial KNN Query without Index
val fact=new GeometryFactory();
val queryPoint=fact.createPoint(new Coordinate(-109.73, 35.08));
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val resultSize = KNNQuery.SpatialKnnQuery(objectRDD, queryPoint, 5); /* The number 5 means 5 nearest neighbors */

fact: com.vividsolutions.jts.geom.GeometryFactory = com.vividsolutions.jts.geom.GeometryFactory@35f6b599
queryPoint: com.vividsolutions.jts.geom.Point = POINT (-109.73 35.08)
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@76d6439b
resultSize: java.util.List[com.vividsolutions.jts.geom.Point] = [POINT (-109.538914 35.123446), POINT (-108.729849 37.196678), POINT (-117.105253 33.48551), POINT (-120.679839 35.25764), POINT (-120.860368 35.398047)]

// Start an example Spatial KNN Query with Index
val fact=new GeometryFactory();
val queryPoint=fact.createPoint(new Coordinate(-109.73, 35.08));
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
objectRDD.buildIndex("rtree"); /* Build R-Tree index */
val resultSize = KNNQuery.SpatialKnnQueryUsingIndex(objectRDD, queryPoint, 5); /* The number 5 means 5 nearest neighbors */

fact: com.vividsolutions.jts.geom.GeometryFactory = com.vividsolutions.jts.geom.GeometryFactory@24046396
queryPoint: com.vividsolutions.jts.geom.Point = POINT (-109.73 35.08)
objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@6db7719d
resultSize: java.util.List[com.vividsolutions.jts.geom.Point] = [POINT (-109.538914 35.123446), POINT (-108.729849 37.196678), POINT (-108.135158 37.242491), POINT (-107.596572 37.000003), POINT (-107.79524 37.225479)]

// Start an example Spatial Join Query without Index
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0 ,"csv","rtree",4); /* The O means spatial attribute starts at Column 0, number 4 means 4 RDD partitions, "rtree" means use R-Tree Spatial Partitioning Grid */
val rectangleRDD = new RectangleRDD(sc, "/Users/sparkuser/spark/data/geospark/zcta510-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val joinQuery = new JoinQuery(sc,objectRDD,rectangleRDD);
val resultSize = joinQuery.SpatialJoinQuery(objectRDD,rectangleRDD).count();
objectRDD.totalNumberOfRecords  /* see https://github.com/DataSystemsLab/GeoSpark/blob/master/src/main/java/org/datasyslab/geospark/spatialRDD/PointRDD.java for API */

objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@730e3723
rectangleRDD: org.datasyslab.geospark.spatialRDD.RectangleRDD = org.datasyslab.geospark.spatialRDD.RectangleRDD@2bf31c8c
joinQuery: org.datasyslab.geospark.spatialOperator.JoinQuery = org.datasyslab.geospark.spatialOperator.JoinQuery@36cecee7
resultSize: Long = 9989

// Start an example Spatial Join Query with Index
val objectRDD = new PointRDD(sc, "/Users/sparkuser/spark/data/geospark/arealm-small.csv", 0 ,"csv","rtree",4); /* The O means spatial attribute starts at Column 0, number 4 means 4 RDD partitions, "rtree" means use R-Tree Spatial Partitioning Grid */
val rectangleRDD = new RectangleRDD(sc, "/Users/sparkuser/spark/data/geospark/zcta510-small.csv", 0, "csv"); /* The O means spatial attribute starts at Column 0 */
val joinQuery = new JoinQuery(sc,objectRDD,rectangleRDD);
objectRDD.buildIndex("rtree"); /* Build R-Tree index */
val resultSize = joinQuery.SpatialJoinQueryUsingIndex(objectRDD,rectangleRDD).count();

objectRDD: org.datasyslab.geospark.spatialRDD.PointRDD = org.datasyslab.geospark.spatialRDD.PointRDD@1301fbdd
rectangleRDD: org.datasyslab.geospark.spatialRDD.RectangleRDD = org.datasyslab.geospark.spatialRDD.RectangleRDD@ebfb5e7
joinQuery: org.datasyslab.geospark.spatialOperator.JoinQuery = org.datasyslab.geospark.spatialOperator.JoinQuery@197ff4a6
resultSize: Long = 9989

Email this to someoneShare on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

Apache Spark and Zeppelin – Big Data Tools

Cranfield University students and staff recently joined other members of the DREAM Centre for Doctoral Training in Big Data, on the excellent ‘Winter School’ in Big Data at the Hartree Centre, the UK’s pre-eminent centre for Big Data technology. We were able to explore the impressive capability of the Apache Spark environment on the Hartree’s IBM compute cluster.

Learning Apache Spark™ offers a useful insight into Big Data processing, and the opportunities available to handling data at scale. Spark is a fast and general engine for large-scale data processing, and has emerged as the software ‘ecosystem’ of choice for contemporary Big Data processing. Its huge advantage over earlier Big Data tool approaches is that it runs all its operations sequentially in memory, avoiding the cost of successive disk operations; as a consequence it is very quick. Spark has four key modules that allow powerful, but complimentary data processing: ‘SQL and DataFrames’, ‘Spark Streaming’, ‘MLlib’ (machine learning) and ‘GraphX’ (graph).

The good news is that one can learn Spark in a number of ways, all at no cost. Most of the big cloud providers who provide Spark offer ‘community accounts’ where one can register a free account in order to learn (e.g. IBM Data Science Experience, databricks and MS Azure to name a few). However, Spark can also be installed locally on a laptop which, if it has a multi-core processor, can then do some parallel processing of a sort: certainly enough for our learning purposes. It is therefore the installation of a local Big Data Spark Environment on a MacBook laptop that forms the basis for this post, (clearly this will all also work on Linux too).

In addition to Spark, this post also allows us to explore the use of the Apache Zeppelin™ notebook environment. Notebooks are a fantastic way to keep a record of projects, with processing code and contextual information all kept in one document. For this whole project exercise then we undertook the following steps:

Load up some sample CSV data

As a very first step, we wanted to download some sample data onto the local disk that could be representative of ‘Big Data’. The CSV format (Comma Separated Values) is widely used as a means of working with large datasets, so we will use this. The Apache Foundation themselves have a number of example files – so we will use one of them – ‘bank.csv’. To pull a file in locally, use the ‘curl‘ command, thus:

curl "https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv" -o "bank.csv"

On other systems, the ‘wget‘ command can also be used (e.g. on linux). After this we have a file ready for later use.
CSV, Comma Separated Values file

Installing Spark

Next, we need to install Spark itself. The steps are as follows:
1. Go to http://spark.apache.org and select ‘Download Spark’.
2. We left the version number ‘drop down’ for version numbers at the latest (default): for us this was v2.0.2
3. We downloaded the resultant file ‘spark-2.0.2-bin-hadoop2.7.tgz’.
4. We created a new folder ‘spark’ in our user home directory, and opening a terminal window, we unpacked the file thus:
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz.
5. After this we checked the files are all present in /Users/geothread/spark/spark-2.0.2-bin-hadoop2.7.
The next step is that the configuration needs checking. In the terminal, move to the conf spark folder:
cd /Users/geothread/spark/spark-2.0.2-bin-hadoop2.7/conf.
6. Templates. Note in the conf file there are a load of files which end *.template (e.g. ‘spark-defaults.conf.template’). These template files are provided for you to edit as required. If you need to do this, you copy the template file, removing the suffix first, then edit as required (e.g. cp spark-defaults.conf.template spark-defaults.conf). In fact, we will leave these default settings as they are for now in our local installation.
7. Running Spark. To run Spark, in terminal, move to the bin folder. We will start off by running scala. Scala is the programming language that Spark is mostly written in, but can also be run at the command line. In running Scala, we can note how the spark context ‘sc’ is made available for use (the spark context is the ‘instance’ of spark that is running):

bin$> ls
beeline pyspark2.cmd spark-shell2.cmd
beeline.cmd run-example spark-sql
derby.log run-example.cmd spark-submit
load-spark-env.cmd spark-class spark-submit.cmd
load-spark-env.sh spark-class.cmd spark-submit2.cmd
metastore_db spark-class2.cmd sparkR
pyspark spark-shell sparkR.cmd
pyspark.cmd spark-shell.cmd sparkR2.cmd

bin$> ./spark-shell
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@8077c97
scala> System.exit(1)

Now instead we will switch to Python. We will try running Python with the API designed to expose it to Spark, pyspark, and so now we can also load and do a line count of that sample CSV data downloaded earlier. Note, the spark context sc is again made available:

bin$> ./pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Python version 2.7.10 (default, Jul 30 2016 18:31:42)
SparkSession available as 'spark'.
>>> sc
<pyspark.context.SparkContext object at 0x10b021f10>
>>> df = spark.read.csv("/Users/geothread/bank.csv", header=True, mode="DROPMALFORMED")
>>> df.count()
4521
>>> exit()

Monitoring jobs

We can also go and check up on the spark jobs we ran, by accessing the web dashboard installed in Spark. It runs by default on ‘port‘ 4040, so note this number must be added to the URL after the colon, thus:

http://localhost:4040/jobs/
Spark dashboard
Hopefully this all works OK and the dashboard can be accessed. The next step is to install and configure the Apache Zeppelin notebook.

Installing Zeppelin

Apache Zeppelin offers a web-based notebook enabling interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. Zeppelin installs and runs as a server – so there are some fiddly bits to getting it going. Under the bonnet, notebooks are saved off in JSON format – but we don’t really need to know this just to use it.
Apache Zeppelin
To obtain and run the Zeppelin notebook, use following steps:
1. Go to https://zeppelin.apache.org and ‘Get Download’.
2. Save off and unpack the file to a new folder created in your home folder, e.g. ‘Users/geothread/zeppelin’.
tar -xvf zeppelin-0.6.2-bin-all.tar
3. Go to the conf folder
cd conf
As before, note the template files, look at the file ‘zeppelin-site.xml.template’ – Zeppelin will run on port 8080 by default. If you need to change this (and we did – we needed it to use port 9080 instead), you can make a copy of this file.
cp zeppelin-site.xml.template zeppelin-site.xml
4. Edit the new file with your favourite text editor, (e.g. with nano), to change the port as required.
5. Also in this file, if you are running the zeppelin server locally then you can also edit the server IP to ‘localhost’. When we’d finished editing for our server, the file was as follows (in summary the two edits were to add ‘localhost’ and ‘9080’):

<property>
<name>zeppelin.server.addr</name>
<value>localhost</value>
<description>Server address</description>
</property>

<property>
<name>zeppelin.server.port</name>
<value>9080</value>
<description>Server port.</description>
</property>

6. At this point we ensured the following lines were in the account .profile configuration file in the home folder noting that, as an alternative, these settings can also be added locally in the configuration files in the zeppelin conf folder too.

export JAVA_HOME=$(/usr/libexec/java_home)
export SPARK_HOME="$HOME/spark/spark-2.0.2-bin-hadoop2.7"

7. The next step may or may not be necessary – it was for us. In fact we know this was necessary for us as we got the error message just like the one described online here:
8. Go to https://mvnrepository.com/artifact/com.fasterxml.jackson and download the files: ‘jackson-core-2.6.5.jar‘, ‘jackson-annotations-2.6.5.jar‘ and ‘jackson-databind-2.6.5.jar‘. Note these are not the latest files available! The latest jackson file version didn’t work for us – but v2.6.5 worked fine.
9. Go to the lib folder and remove them (best to just move to somewhere else, e.g. the downloads folder) the files ‘jackson-core-2.5.0.jar‘, ‘jackson-annotations-2.5.0.jar‘ and ‘jackson-databind-2.5.0.jar
10. Copy the three new downloaded v2.6.5 version files into the lib folder.
11. Now go to the bin folder and start the server (before using Zeppelin at any time, you will need to ensure the server is running):
./zeppelin-daemon.sh start
12. Note you can also stop and restart the daemon at any time, like this:
./zeppelin-daemon.sh restart and ./zeppelin-daemon.sh stop
You may find it useful to add some shortcuts to your .profile file to save time, for example (with each command being all on one line, and using the correct path of course):
alias zeppelin_start='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh start'
alias zeppelin_stop='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh stop'
alias zeppelin_restart='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh restart'

13. Next, open a browser and open the zeppelin notebook home page:
http://localhost:9080 (or whatever the port number is for you), and hopefully you are off, up and running.
Apache Zeppelin
14. Try running the sample notebooks provided. There are many online tutorials for Spark available – such as the excellent one here. So there is no need to reinvent the wheel repeating all that here on GeoThread. However, there are not so many tutorials showing how to integrate geospatial data into Big Data operations, an interest for us – so we hope a future blog will look at that.

If you want to know more about Zeppelin and get a walk through of its many features, watch this video by Moon soo Lee, the understated genius who created Zeppelin, speaking in Amsterdam at Spark Summit Europe 2015.

Email this to someoneShare on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

Making the Raspberry Pi work, the next steps – Pi Society

There is a lot of interest in the amazing Raspberry Pi 3 computer here at Cranfield University. Once you have the basics for the Raspberry Pi in place, there are a few tips and tricks you can follow to make the Pi work better for you, explained below. The assumption here is that you already have the Pi connected up to a monitor, and have a keyboard and mouse plugged in and are running Jessie Raspian (if not, see our earlier Pi tutorials). the topics covered here are:
Setting a system password | Getting WiFi running | Accessing the Pi from another computer | Moving files to and from the Pi from another computer – scp | Installing and updating software on the Pi | Setting aliases | Learning Python

Setting a system password

Top
The default account on the Pi is in fact username ‘pi’, and there is no password for the account by default. It is good practice to set a password, particularly once you activate WiFi for the Pi. To do this, Select Menu > Preferences > Raspberry Pi Configuration > System. Select the password option and enter in a secure password.

While you are at this dialogue box, you can explore the other options there – for example, under ‘Localisation’ you can set the Pi’s timezone appropriately.

Getting WiFi running

Top
WiFi on the Pi is very straight forward if you have a home router running WPA security. At the graphical interface, in the op right corner of the screen is the WiFi icon. Right mouse click the WiFi icon and select the first ‘WiFi Networks Settings’ option. In the Network Settings dialogue, in the ‘configure’ drop-down, select ‘SSID’ and search for your router, select it and press OK. If there is a WiFi password required, select the WiFi icon again, locate the connection you just made in the list and double click – a dialogue appears for you to enter in the password. Once done, hopefully the connection is made and you will be online.

To get the Pi to connect wirelessly to the campus WiFi network ‘EduRoam’ is rather more tricky – see https://www.raspberrypi.org/forums/viewtopic.php?f=28&t=86253
Using the Jessie graphical user interface of the Pi, you should be able to follow the same procedure to get the Pi online with Eduroam. If you intend to do this however, you must ensure that users all have passwords set, and that the hostname is something other than ‘pi’!

Accessing the Pi from another computer

Top
To start with you have to connect a monitor and keyboard/mouse up to the pi to connect to it. However, ultimately you may want to have it running on its own, and so then be able to connect to it from a different computer – on the command line, or graphically.

Assuming the Pi is using WiFi and is online, find out what its IP address is. To do this, you can either consult the WiFi router’s web console to look for its connected clients, or you can type:
sudo ifconfig
In a home WiFi situation, the IP address will likely be something like ‘192.168.1.nn’ (e.g. 192.168.1.11).

From your other computer (e.g. a PC or Mac), you can now run an ‘ssh‘ (secure shell) client. To do this, on the PC you might use the excellent freeware tool ‘putty’; on a Mac you can fire up the terminal and at the prompt enter ‘ssh ‘. Make the connection by incorporating the pi user name (which by default is ‘pi’) into the IP address, thus in a mac terminal window for the user name ‘pi’, you would type:
ssh pi@192.168.1.11

Sometimes accessing via the command line is just not enough, and you may want to run an app with a graphical interface from another computer. If you use the secure shell (ssh) to log in to the Pi, you may be able to carry the ‘X-windows’ session over to the computer using the ‘-X’ parameter, thus:

ssh –X pi@192.168.1.11

On a Mac running Sierra, you may have to run with a ‘-Y’ instead, (assuming you have Quartz installed):

ssh –Y pi@192.168.1.11

On Windows computers, you can install and use Xming for the X-windows, and then use putty as noted above for ssh sessions. Putty has a dialogue tick box option to carry X11 over to the local session.

To test this works, install the standard set of X11 apps, then run say xclock, thus:

sudo apt-get install x11-apps
xclock

All the above can be a bit fiddly – and a much easier method is to use a VNC (Virtual Network Computing) server. In former times to do this you would install a software package called ‘tightvncserver’ – and there are lots of instructions out on the web to do this – and you can still do this if you wish. However, as from Sept 2016, the Raspberry Pi 3 now includes a VNC server by default, from RealVNC – (see https://www.realvnc.com/docs/raspberry-pi.html). Although installed it is not activated by default. To activate it, select Menu > Preferences > Raspberry Pi Configuration > Interfaces. Ensure the VNC tick box is Enabled and then reboot the Pi. From now on, VNC Server will start automatically whenever your Pi is powered on.

Now, on the computer you want to connect to the Pi, you will also need to install the RealVNC viewer. Visit ‘https://www.realvnc.com/download/viewer/‘, download and install the appropriate viewer. Note you will need to know and enter the IP address of the Pi, as well as your account and password.

Moving files to and from the Pi from another computer – scp

Top
Very often you will want to copy files to and from the Pi – here’s how:

If you have the ssh tools running as above, you can just as easily use the ‘secure copy’ command ‘scp’ to move files to and from the Pi from your controlling computer, as described at https://www.raspberrypi.org/documentation/remote-access/ssh/scp.md.

Copying a file from your computer TO the Pi

scp myfile.txt pi@192.168.1.11:

or to copy the file to a specific folder:
scp myfile.txt pi@192.168.1.3:myfolder/
this copies the file to the /home/pi/myfolder/ directory on your Raspberry Pi

Copying a file from your Pi TO the computer

scp pi@192.168.1.11:myfile.txt .

Installing and updating software on the Pi

Top
You will need to keep the software installed up to date on the Pi. Being open source, it seems that there are constantly updates that need installing – so this is a procedure that should be run through on a regular basis. The tool needed to do this is called ‘apt’.

Open a terminal and enter (‘apt-get’ is all one word, with no spaces)
sudo apt-get update
This may take a few moments – don’t interrupt until it is finished.

This command line tool is very powerful and can be used to install software as well as keep it updated. For example, to install the package ‘curl’, you type:
sudo apt-get install curl

ps. by the way, curl is a useful command-line tool to transfer data from or to a server, using one of the supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET and TFTP).

If you prefer a software package manager tool with a graphical interface, you can try ‘aptitude’:
sudo aptitude

Setting aliases

Top
Using the Pi, you quickly realise that the command line, using the ‘terminal’, offers you powerful and efficient command tools for configuring and running the computer. Did you know you can set up shortcuts to help make the use of these commands even more efficient. An example we shall use here is to set an alias of ‘ll’ to run the full command ‘ls –al’ – a little shortcut that speeds up making full file listings on the screen – so that then, every time you type ‘ll’ the command ‘ls –al’ gets run.

Each time you start a terminal session, a configuration file in your home folder named ‘.bashrc’ is read – and shortcuts can be placed into this file. If you have a look in this file (it is a plain text file) you can actually see that there is a line in it making a call to incorporate any aliases defined in a companion file called ‘.bashrc_aliases’. Note this latter file doesn’t exist by default, so enter the following:

cd
nano .bashrc_aliases

This starts up the text editor ‘nano’ (any text editor would do the job as well!) and creates the file – now enter a new line in it:
alias ll=’ls =al’

Save the file out and exit (in nano this is Ctrl+O, Ctrl+X)

Now close the terminal window (type ‘exit’). Next time you run terminal session, type ‘ll’ to see the file listing – try and see.

Be sure to visit https://www.raspberrypi.org/ to learn more about these amazing machines.

Learning Python

Top
The Pi offers a fantastic means to learn programming – using languages such as Python. There are so many Python tutorials out there that we don’t write our own here. However, here are a few guidelines to get you up and running on the excellent quick Python tutorial from Magnus Lie Hetland. This is a good starting point for anyone who has done a spot of coding before as it is short, sweet and to the point!

The tutorial is online here

Firstly you need a decent editor (did you know there were so many!). Sadly, our favourite editor PyCharm is not yet posted to the Pi.

At the command line you can use ‘nano’:
nano myfile.py

Better still use a graphical editor – assuming it is installed, use say ‘Leafpad’
(the ‘&’ symbol means run editor as a background process):
leafpad myfile.py &

Best of all, use the Idle3 editor on the Pi:
idle3 myfile.py &

Secondly, run the code with the python command as below, (or in Idle – just press ‘F5’ to run the source code).
python myfile.py
Then you can follow Magnus’ examples through! For example, copy and run this code:

# Exercise 2
# Part 1:
print ("Part 1\nThis program continually reads in numbers\nand adds them together until the sum reaches 100:")
total = 0
goes = 0
while total < 100: # Get the user's choice: number = int(input("enter a number > "))
total += number
goes += 1
print ("The sum is now", total, "and it took you", goes, "turns.")
# Part 2:
print ("\nPart 2\nThis program receives a number of values\nand prints the sum")
total = 0
number = int(input("enter the number of values to enter > "))
for counter in range(0, number):
# Get the user's choice:
number = int(input("enter a number > "))
total += number
print ("The total was", total)

Email this to someoneShare on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest

Raspberry Pi 3: Operating LEDs – Pi Society

This is a tutorial about controlling external components from our Raspberry Pi 3 computer – turning on and off LEDs and sounding a buzzer, using the Python language, as part of the Cranfield University “Pi Society”. These materials were purchased in the UK from the Pi Hut (http://thepihut.com). The components used are part of the excellent Cam Jam suite (http://camjam.me) – a great site where you can download many worksheets.

In this third tutorial, we set up a Raspberry Pi 3 computer, then get it to turn on and off some LEDs and control a buzzer.

Email this to someoneShare on FacebookTweet about this on TwitterShare on LinkedInShare on Google+Pin on Pinterest