Tag Archives: Apache

Matters relating to the Apache open source software project and its software titles – eg. the popular web server

Apache Spark and Zeppelin – Big Data Tools

Cranfield University students and staff recently joined other members of the DREAM Centre for Doctoral Training in Big Data, on the excellent ‘Winter School’ in Big Data at the Hartree Centre, the UK’s pre-eminent centre for Big Data technology. We were able to explore the impressive capability of the Apache Spark environment on the Hartree’s IBM compute cluster.

Learning Apache Spark™ offers a useful insight into Big Data processing, and the opportunities available to handling data at scale. Spark is a fast and general engine for large-scale data processing, and has emerged as the software ‘ecosystem’ of choice for contemporary Big Data processing. Its huge advantage over earlier Big Data tool approaches is that it runs all its operations sequentially in memory, avoiding the cost of successive disk operations; as a consequence it is very quick. Spark has four key modules that allow powerful, but complimentary data processing: ‘SQL and DataFrames’, ‘Spark Streaming’, ‘MLlib’ (machine learning) and ‘GraphX’ (graph).

The good news is that one can learn Spark in a number of ways, all at no cost. Most of the big cloud providers who provide Spark offer ‘community accounts’ where one can register a free account in order to learn (e.g. IBM Data Science Experience, databricks and MS Azure to name a few). However, Spark can also be installed locally on a laptop which, if it has a multi-core processor, can then do some parallel processing of a sort: certainly enough for our learning purposes. It is therefore the installation of a local Big Data Spark Environment on a MacBook laptop that forms the basis for this post, (clearly this will all also work on Linux too).

In addition to Spark, this post also allows us to explore the use of the Apache Zeppelin™ notebook environment. Notebooks are a fantastic way to keep a record of projects, with processing code and contextual information all kept in one document. For this whole project exercise then we undertook the following steps:

Load up some sample CSV data

As a very first step, we wanted to download some sample data onto the local disk that could be representative of ‘Big Data’. The CSV format (Comma Separated Values) is widely used as a means of working with large datasets, so we will use this. The Apache Foundation themselves have a number of example files – so we will use one of them – ‘bank.csv’. To pull a file in locally, use the ‘curl‘ command, thus:

curl "https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv" -o "bank.csv"

On other systems, the ‘wget‘ command can also be used (e.g. on linux). After this we have a file ready for later use.
CSV, Comma Separated Values file

Installing Spark

Next, we need to install Spark itself. The steps are as follows:
1. Go to http://spark.apache.org and select ‘Download Spark’.
2. We left the version number ‘drop down’ for version numbers at the latest (default): for us this was v2.0.2
3. We downloaded the resultant file ‘spark-2.0.2-bin-hadoop2.7.tgz’.
4. We created a new folder ‘spark’ in our user home directory, and opening a terminal window, we unpacked the file thus:
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz.
5. After this we checked the files are all present in /Users/geothread/spark/spark-2.0.2-bin-hadoop2.7.
The next step is that the configuration needs checking. In the terminal, move to the conf spark folder:
cd /Users/geothread/spark/spark-2.0.2-bin-hadoop2.7/conf.
6. Templates. Note in the conf file there are a load of files which end *.template (e.g. ‘spark-defaults.conf.template’). These template files are provided for you to edit as required. If you need to do this, you copy the template file, removing the suffix first, then edit as required (e.g. cp spark-defaults.conf.template spark-defaults.conf). In fact, we will leave these default settings as they are for now in our local installation.
7. Running Spark. To run Spark, in terminal, move to the bin folder. We will start off by running scala. Scala is the programming language that Spark is mostly written in, but can also be run at the command line. In running Scala, we can note how the spark context ‘sc’ is made available for use (the spark context is the ‘instance’ of spark that is running):

bin$> ls
beeline pyspark2.cmd spark-shell2.cmd
beeline.cmd run-example spark-sql
derby.log run-example.cmd spark-submit
load-spark-env.cmd spark-class spark-submit.cmd
load-spark-env.sh spark-class.cmd spark-submit2.cmd
metastore_db spark-class2.cmd sparkR
pyspark spark-shell sparkR.cmd
pyspark.cmd spark-shell.cmd sparkR2.cmd

bin$> ./spark-shell
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@8077c97
scala> System.exit(1)

Now instead we will switch to Python. We will try running Python with the API designed to expose it to Spark, pyspark, and so now we can also load and do a line count of that sample CSV data downloaded earlier. Note, the spark context sc is again made available:

bin$> ./pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Python version 2.7.10 (default, Jul 30 2016 18:31:42)
SparkSession available as 'spark'.
>>> sc
<pyspark.context.SparkContext object at 0x10b021f10>
>>> df = spark.read.csv("/Users/geothread/bank.csv", header=True, mode="DROPMALFORMED")
>>> df.count()
4521
>>> exit()

Monitoring jobs

We can also go and check up on the spark jobs we ran, by accessing the web dashboard installed in Spark. It runs by default on ‘port‘ 4040, so note this number must be added to the URL after the colon, thus:

http://localhost:4040/jobs/
Spark dashboard
Hopefully this all works OK and the dashboard can be accessed. The next step is to install and configure the Apache Zeppelin notebook.

Installing Zeppelin

Apache Zeppelin offers a web-based notebook enabling interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. Zeppelin installs and runs as a server – so there are some fiddly bits to getting it going. Under the bonnet, notebooks are saved off in JSON format – but we don’t really need to know this just to use it.
Apache Zeppelin
To obtain and run the Zeppelin notebook, use following steps:
1. Go to https://zeppelin.apache.org and ‘Get Download’.
2. Save off and unpack the file to a new folder created in your home folder, e.g. ‘Users/geothread/zeppelin’.
tar -xvf zeppelin-0.6.2-bin-all.tar
3. Go to the conf folder
cd conf
As before, note the template files, look at the file ‘zeppelin-site.xml.template’ – Zeppelin will run on port 8080 by default. If you need to change this (and we did – we needed it to use port 9080 instead), you can make a copy of this file.
cp zeppelin-site.xml.template zeppelin-site.xml
4. Edit the new file with your favourite text editor, (e.g. with nano), to change the port as required.
5. Also in this file, if you are running the zeppelin server locally then you can also edit the server IP to ‘localhost’. When we’d finished editing for our server, the file was as follows (in summary the two edits were to add ‘localhost’ and ‘9080’):

<property>
<name>zeppelin.server.addr</name>
<value>localhost</value>
<description>Server address</description>
</property>

<property>
<name>zeppelin.server.port</name>
<value>9080</value>
<description>Server port.</description>
</property>

6. At this point we ensured the following lines were in the account .profile configuration file in the home folder noting that, as an alternative, these settings can also be added locally in the configuration files in the zeppelin conf folder too.

export JAVA_HOME=$(/usr/libexec/java_home)
export SPARK_HOME="$HOME/spark/spark-2.0.2-bin-hadoop2.7"

7. The next step may or may not be necessary – it was for us. In fact we know this was necessary for us as we got the error message just like the one described online here:
8. Go to https://mvnrepository.com/artifact/com.fasterxml.jackson and download the files: ‘jackson-core-2.6.5.jar‘, ‘jackson-annotations-2.6.5.jar‘ and ‘jackson-databind-2.6.5.jar‘. Note these are not the latest files available! The latest jackson file version didn’t work for us – but v2.6.5 worked fine.
9. Go to the lib folder and remove them (best to just move to somewhere else, e.g. the downloads folder) the files ‘jackson-core-2.5.0.jar‘, ‘jackson-annotations-2.5.0.jar‘ and ‘jackson-databind-2.5.0.jar
10. Copy the three new downloaded v2.6.5 version files into the lib folder.
11. Now go to the bin folder and start the server (before using Zeppelin at any time, you will need to ensure the server is running):
./zeppelin-daemon.sh start
12. Note you can also stop and restart the daemon at any time, like this:
./zeppelin-daemon.sh restart and ./zeppelin-daemon.sh stop
You may find it useful to add some shortcuts to your .profile file to save time, for example (with each command being all on one line, and using the correct path of course):
alias zeppelin_start='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh start'
alias zeppelin_stop='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh stop'
alias zeppelin_restart='$HOME/zeppelin/zeppelin-0.6.2-bin-all/bin/zeppelin-daemon.sh restart'

13. Next, open a browser and open the zeppelin notebook home page:
http://localhost:9080 (or whatever the port number is for you), and hopefully you are off, up and running.
Apache Zeppelin
14. Try running the sample notebooks provided. There are many online tutorials for Spark available – such as the excellent one here. So there is no need to reinvent the wheel repeating all that here on GeoThread. However, there are not so many tutorials showing how to integrate geospatial data into Big Data operations, an interest for us – so we hope a future blog will look at that.

If you want to know more about Zeppelin and get a walk through of its many features, watch this video by Moon soo Lee, the understated genius who created Zeppelin, speaking in Amsterdam at Spark Summit Europe 2015.

Cookbook – Building a LAMP Server on the Raspberry Pi computer

Raspberry_Pi_LogoPurpose: In this project we wanted to use one of the amazing Raspberry Pi computers (http://www.raspberrypi.org/) to build a fully functional ‘LAMP’ environment – that is ‘Linux’ ‘Apache’ ‘MySQL’ and ‘PHP’. On this Cranfield University site, we describe how we did it. Only the basic commands are placed here, see the links to other websites for extra explanations and post-install configurations etc.

Obtaining a Raspberry Pi

We suggest buying a Raspberry Pi with a case and power supply, as well as SDHC card with Raspian OS. However, for experimentation purposes, you may also wish to buy another 4Gb SDHC card (for compatible types, see http://elinux.org/RPi_VerifiedPeripherals before purchase). Then you can simply swap cards to try our different configurations (remember, always shut the Pi down before removing the card). You will also need to obtain an Ethernet cable, an HDMI to HDMI cable, a wired USB keyboard and mouse, and a TV that accepts HDMI input. We are also assuming you have a home router with Internet access that the Ethernet cable will plug into.

The initial configuration is done sitting in front of the plugged-in TV, but once ‘SSH’ (and potentially ‘TightVNC’) are installed, the Pi can be left plugged into the router and then be accessed remotely via a separate laptop, connected say wirelessly to the router, either via a secure shell ‘SSH’ client such as ‘Putty’ (http://www.chiark.greenend.org.uk/~sgtatham/putty/), or the graphical ‘TightVNC’ client (http://www.tightvnc.com/). These latter tools need to be installed on the separate computer/laptop first. Do I need TightVNC? TightVNC will be useful if you intend to run graphical programmes off the Pi – however, you don’t need it if you are configuring the Pi just as a server (e.g. web server or database server). In which case Putty will be fine.

New SDHC Card image

If you buy a pre-loaded Raspian card, you don’t need this step, but if you did start with a blank SDHC card these are the steps you need. Take a new SDHC – fat32, 4Gb empty card – formatted if not new. Note, it is best to use an external SDHC USB reader/writer device, rather than the slot built into your laptop if your laptop is not pretty newish.

To format SDHC cards, first download the utility ‘SDformatter’ (https://www.sdcard.org/downloads/formatter_3/). Select ‘Options’ -> ‘format size adjustment on’. Use this tool, not the Windows format programme.

For advice on easy set up of the card, see http://elinux.org/RPi_Easy_SD_Card_Setup.

Next, download the latest Pi Operating System, called Raspian ‘Wheezy’, from http://www.raspberrypi.org/downloads.

If you download a windows file, you can unzip the contents to reveal the ‘.img’ imagefile.

Next, to write the image to the blank card, use the SDHC Image writer from https://launchpad.net/win32-image-writer -(now at)> http://sourceforge.net/projects/win32diskimager/

Other scary options that some people use include the ‘flashnul-1rc1′ or the ‘fedora-arm-installer-1.0.3-7-x32’ tools. The ‘flashnul-1rc1’ in particular needs especial care and attention (being in Russian comrade!)

Once Raspian is unpacked some files will be visible on the card in Windows – but don’t edit them in windows! Some files are hidden to Windows so it doesn’t look like a full card. Insert the card then into the Pi and bootup. This should start up and kick off raspi-config. The default account/password is:

User: pi                 Password: raspberry

You should be running in terminal mode, type ‘startx’ to kick off the graphical user window environment (this is something a lot of guides seem to fail to mention!)

startx

Spend a little time now exploring the graphical interface. The menus and system options are available from a drop down menu in the top left corner. Note the very first time you run, you may be presented with the system configuration tool ‘raspi-config’, explained below.

The rest of the article assumes an Ethernet cable connected to the Internet is plugged in. Note there are wireless USB dongles that should work with the Pi ‘out of the box’, instructions for configuring that are elsewhere here on Geothread – see http://www.geothread.net/cookbook-configuring-wifi-on-raspberry-pi/.

Basic Configuration

You may find you are either running raspi-config immediately, or else this useful tool can be started at the command line withn the command ‘sudo raspi-config’. Note many of the commands in this tutorial are preceded by ‘sudo’ – meaning run the command with root ‘superuser’ authority.

Once running raspi-config, you can select from a few useful options from its menu, thus:

expand_rootfs

This option ‘inflates’ the filesystem to fill SD card. This makes the whole capacity of the SDHC card available for storage – a good idea!

ssh server – Enable

The ssh ‘secure shell’ allows programmes like ‘putty’ on a PC to run a terminal session onto the Pi. However to do this the ssh server needs to be running and enabled on the Pi.

update

Also select the option ‘update’ to update the raspi-config itself.

Once you ‘Finish’, you will pass back to the command line.

Next, you must update the core operating system files. You enter:

sudo dpkg-reconfigure tzdata 
sudo apt-get update 
sudo apt-get upgrade 
sudo apt-get dist-upgrade

This updates the system software – if the Wheezy image was up to date this shouldn’t take too long. If it wasn’t time to make a cup of tea! Open source software gets updated FAR more frequently that proprietary software, so you will need to run the last three commands from time to time to keep the Pi up to date.

To be able to access the Pi remotely, install the TightVNC server if required. This gives a graphical ‘X-Windows’ GUI onto Raspberry from remote laptop/PC/Mac. Note you also need to install TightVNC client on your separate laptop/computer that will be used to access the Pi.

sudo apt-get install tightvncserver

Following the excellent instructions at http://www.neil-black.co.uk/raspberry-pi-beginners-guide#.UVWzqBfIbJY, or the more recent update at http://www.neil-black.co.uk/the-updated-raspberry-pi-beginners-guide#.VFPzeL5urww, next you need to configure the tightvncserver. Create a new file called ‘tightvncserver’ in the init.d directory (nano is a text editor programme):

sudo nano /etc/init.d/tightvncserver

Into this file, enter the following:

#!/bin/sh 
# /etc/init.d/tightvncserver 
VNCUSER='pi' 
case "$1" in 
    start) 
        su $VNCUSER -c '/usr/bin/tightvncserver :1' 
        echo "Starting TightVNC Server for $VNCUSER " 
        ;; 
    stop) 
        pkill Xtightvnc 
        echo "TightVNC Server stopped" 
        ;; 
    *) 
        echo "Usage: /etc/init.d/tightvncserver {start|stop}" 
        exit 1 
        ;; 
esac 
exit 0

Give the script executable permission:

sudo chmod 755 /etc/init.d/tightvncserver

Now start or stop the service manually thus (this is the basic model used for dealing with all Debian processes).

sudo /etc/init.d/tightvncserver start
sudo /etc/init.d/tightvncserver stop

Make the TightVNC server start every time the Raspberry Pi starts.

sudo update-rc.d tightvncserver defaults99

Setting an IP Address

By default the Pi uses Dynamic Host Control Protocol (DHCP) meaning it may get a new ‘IP’ address each time it boots up. If we are using the Pi as a web server/database server this could be a nuisance. If so, we can force the Pi to used a ‘fixed IP’ address so it always remains the same.

Following the excellent instructions at http://www.neil-black.co.uk/raspberry-pi-beginners-guide#.UVWzqBfIbJY, set the IP address from DHCP to fixed. This will make it easier to locate your webserver later.

First, identify the current computer network ‘IP’ address, type:

ifconfig

On a home network, it is likely to be something like 192.168.1.xx. We will fix the Pi as ‘192.168.1.100’. Edit the network interfaces configuration file with the text editor ‘nano’, thus:

sudo nano /etc/network/interfaces

Replace the lines

iface lo inet loopback 
iface eth0 inet dhcp

With

iface lo inet loopback 
iface eth0 inet static 
address 192.168.1.100 
netmask 255.255.255.0 
gateway 192.168.1.1

Note we used 192.168.1.100  (this being a private network subnet address)

See also http://my-music.mine.nu/images/rpi_raspianwheezy_setup.pdf for more guidance.

Now reboot the Pi. To do this, type in:

sudo shutdown -r now

This reboots the Pi ‘now’ and then hopefully you can log on again, but this time potentially as a remote session with SSH or TightVNC from your laptop (where the laptop has a wireless connection to the router). Note the IP address is now fixed as noted above. If you have a laptop/PC on the same home network, you can open a command prompt and run ‘ping 192.168.1.100’ to see if you receive a response from the Pi.

If you use tightvnc, rather than ssh, then on your laptop/pc vncclient installation, you need to enter in the address of the Pi for the connection, e.g. 192.168.1.100:1. Don’t omit the ‘:1’ at the end of the address!! You can also explore the options to set the screen size ‘geometry’ to 1024×728 with a colour depth of 24bits.

Linux Commands

You need to be familiar with Linux commands. Learning Linux can be daunting – some good documentation is here http://www.debian.org/doc/#manuals – best is the one page reference card here http://www.debian.org/doc/#other.

In the next steps, now the basics are in place, we proceed to the installtions of the web and database server software. See also http://www.wikihow.com/Make-a-Raspberry-Pi-Web-Server

Apache Web Server and PHP

To install the Apache web server and PHP (find out more here http://www.apache.org/ and here http://php.net/), type:

sudo apt-get install apache2 php5 libapache2-mod-php5

To ‘restart’, ‘stop’ or ‘start’ apache, use the following command (varying the last word)

sudo service apache2 restart

Note the Apache webroot location for your webfiles is ‘/var/www

The Apache log files are put in ‘/var/log/apache2

MySQL Database Server

sudo apt-get install mysql-server mysql-client php5-mysql

You need to enter in a database system password, such as ‘raspberry’! Once installed restart and you can hopefully access the MySQL command line prompt thus:

sudo service mysql restart 
mysql  -u pi –p

Check the interactive logon works as user pi. Try and ‘use’ a database and select records:

use mysql 
select * from users;

Use <ctrl>+c to quit

For logging, see http://serverfault.com/questions/71071/how-to-enable-mysql-logging

PHP

PHP should have been installed also. In the /var/www webroot, create a new file called testphp.php with the ‘nano’ editor, and enter the following. Once created, run the file in the your laptop/pc browser to see the Pi PHP status:

sudo nano /var/www/testphp.php

Enter the text oneliner below into the file.

<?php phpinfo(); ?>

Your laptop/pc web browser ‘should be able to see this web page served up. Enter the IP address we set earlier, plus this file, thus:

In web browser, enter the URL http://192.168.1.100/testphp.php

Hopefully you should see the PHP status page appear.

Mysqli

In PHP you should use the recommended MySQLi command extension to access the MySQL database. It ‘should’ be installed already – but look down the phpinfo() web page above to find it and ensure it is shown as ‘enabled’.

To learn more, see http://codular.com/php-mysqli         (but do check the phpinfo() output above first)

Phpmyadmin

To manage the MySQL database, we recommend the excellent ‘phpmyadmin’ MySQL web management console. All the MySQL administration can then be undertaken through a web page.

See http://www.dingleberrypi.com/2012/09/tutorial-install-phpmyadmin-on-your-raspberry-pi/

apt-get install phpmyadmin

Select ‘apache2’ as the web server.

Create a phpmyadmin password, such as ‘raspberry’ (or whatever – but don’t forget it!)

You should now be able to call up the phpmyadmion home page at:

http://192.168.1.100/phpmyadmin – login as root/raspberry

Note user privileges to the MySQL database are all managed via phpmyadmin. Each user account needs to have specified the host from which the user will come to access the database. The ‘%’ (all servers) option didn’t work for us and so we needed to repeat a configuration for servers: ‘127.0.0.1’, ‘192.168.1.100’, ‘localhost’ and ‘raspberry pi’. A bit fiddly – essentially you open up the privileges for the user and duplicate the settings varying the host each time. To help with this, note that once phpmyadmin is installed, one can see the various options the phpmyadmin user has pre-assigned. Note that connecting to MySQL from the SSH terminal uses ‘localhost’, whereas connecting from a webserver can use ‘127.0.0.1’ – so the source varies according to source. One needs to experiment – hmm!

Remember later that if you ever receive a ‘500’ error on the webserver when trying to access the database, it is likely due to the incorrect privileges. You may need to look at the webserver apache logfiles in a new window (‘/var/log/apache2’) to actually see the error:

sudo more /var/log/apache

or better still, to see only new lines as they appear in the logfile:

sudo tail –f /var/log/apache2

VSFTP

We need the means to get web pages we write off the laptop/pc onto the Pi webserver. For this, we need an ftp daemon running to upload files onto Pi – we suggest using vsftp, it is the best (but not the only) ftp server for debian.

sudo apt-get install vsftpd

Based on the instructions at http://www.instructables.com/id/Raspberry-Pi-Web-Server/step9/Install-an-FTP-server/, now edit the vsftp configuration file.

sudo nano /etc/vsftpd.conf

Search down through the file and change the following lines:
anonymous_enable=YES Change To anonymous_enable=NO
#local_enable=YES Change To local_enable=YES
#write_enable=YES Change To write_enable=YES

Also, add a line to the bottom of the file:
force_dot_files=YES
Quit the editor and restart the vsftp server.

sudo service vsftpd restart

We suggest NOT following the additional steps in some guides for messing (post install) with pi account line in passwd file etc. We trashed an installation doing this. You ‘should’ be able to just sudo apt-get the vsftp server and reboot – as we did second time round. Make the minor fixes to vsftp configuration file described above. But we suggest to do these edits and nothing else.

On your laptop/pc, you need an ‘ftp’ client . We suggest the freely available ‘filezilla’ ftp client (https://filezilla-project.org) for transferring files across by ftp. ftp://192.168.1.1 and Login to the Pi as user pi. Save the configuration to access the Pi to facilitate future access.

Web files

Set webroot file privileges

sudo chown -R pi /var/www

Now a set of website files can be copied from the PC/laptop with filezilla to this folder and accessed from the laptop/pc webbrowser thus: http://192.168.1.100

Disk capacity free

Here’s how you can establish how much disk space is left over. After all of the above we had used about 1.8Gb on the 4Gb SDHC card.

df –h

Shutting Rasperry Pi down

sudo shutdown –h now

To reboot/restart it when running

sudo shutdown –r now

Epilogue

The Raspberry Pi is an excellent training tool for LAMP installations and can even be serviceable for light operational tasks. It is an educational tool though and a really great way to learn Linux, Apache, PHP, MySQL etc… One great advantage is the SDHC card filesystem. You can keep several installations and configurations on different cards and, as long as machine is correctly powered down, swap cards and reboot as required.

Backing up these SDHC cards is not easy and not always predictable. Tools like the win32diskimager and flashnul-1rc1 can image a card to a ‘.img’ file, but the file may a. not fit back onto a formatted card of same capacity, and b. not work anyway after reboot. Googling shows there are a lot of approaches taken – which is best though isn’t clear! Have fun!