Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 Setup


This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.


  • Nutch – the crawler (fetches and parses websites)
  • HBase – filesystem storage for Nutch (Hadoop component, basically)
  • Gora – filesystem abstraction, used by Nutch (HBase is one of the possible implementations)
  • ElasticSearch – index/search engine, searching on data created by Nutch (does not use HBase, but its down data structure and storage)


Install OpenJDK, ant and ElasticSearch via your repository manager of choice (ES can be installed by using the .deb linked above, if you need).

Extract Nutch and HBase somewhere. From now on, we will refer to the Nutch root directory by$NUTCH_ROOT and the HBase root by $HBASE_ROOT.

Setting up HBase

  1. edit $HBASE_ROOT/conf/hbase-site.xml and add
  2. edit $HBASE_ROOT/conf/ and enable JAVA_HOME and set it to the proper path:
    -# export JAVA_HOME=/usr/java/jdk1.6.0/
    +export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

    This step might seem redundant, but even with JAVA_HOME being set in my shell, HBase just didn’t recognize it.

  3. kick off HBase:

Setting up Nutch

  1. enable the HBase dependency in $NUTCH_ROOT/ivy/ivy.xml by uncommenting the line
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
  2. configure the HBase adapter by editing the $NUTCH_ROOT/conf/
  3. build Nutch
    $ cd $NUTCH_ROOT
    $ ant clean
    $ ant runtime

    This can take a while and creates $NUTCH_ROOT/runtime/local.

  4. configure Nutch by editing $NUTCH_ROOT/runtime/local/conf/nutch-site.xml:
  5. configure HBase integration by editing $NUTCH_ROOT/runtime/local/conf/hbase-site.xml:

That’s it. Everything is now setup to crawl websites.

Adding new Domains to crawl with Nutch

  1. create an empty directory. Add a textfile containing a list of seed URLs.
    $ mkdir seed
    $ echo "" >> seed/urls.txt
    $ echo "" >> seed/urls.txt
    $ echo "" >> seed/urls.txt
  2. inject them into Nutch by giving a file URL (!)
    $ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/

Actual Crawling Procedure

  1. Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
    $ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10

    The above command will create job batches for 10 URLs.

  2. Fetch the URLs. We are not clustering, so we can simply fetch all batches:
    $ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all
  3. Now we parse all fetched pages:
    $ $NUTCH_ROOT/runtime/local/bin/nutch parse -all
  4. Last step: Update Nutch’s internal database:
    $ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all

On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.

Putting Documents into ElasticSearch

Easy peasy:

$ $NUTCH_ROOT/runtime/local/bin/nutch index -all

Query for Documents

The usual ElasticSearch way:

$ curl -X GET "http://localhost:9200/_search?query=my%20term"

How to install memcache on windows

Check your operating system whether 32bit or 64 bit. Based on that you need to download the binary version.

  1. 64bit Os –

  2. 32bit Os –

  • Place the binary file on C or D location
  • Now Run the Command Prompt as Administrator – type the cmd –  C:/memcached/memcached.exe  -d install
  • Once Installed, Start the service – C:\memcached\memcached.exe -d start
  • Verify the service running properly on Start- > Run -> services.msc
  • Check your php extensions directory for php_memcache.dll
  • If you don’t have it, Download
  • Now Edit the Php.ini file and add following lines at extension section
  1. extension=php_memcache.dll
  2. Restart your apache server and Now we are good go.

If you are using Drupal then add below lines to settings.php

$conf['cache_backends'][] = 'sites/all/modules/contrib/memcache/';
$conf['cache_default_class'] = 'MemCacheDrupal';
$conf['cache_class_cache_form'] = 'DrupalDatabaseCache';

you can check the memcahce settings at “admin/reports/memcache”


If you are using Core php then use below code for checking memcache working or not.

$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ("Could not connect");
$version = $memcache->getVersion();
echo "Server's version: ".$version."<br/>\n";
$tmp_object = new stdClass;
$tmp_object->str_attr = 'test';
$tmp_object->int_attr = 123;
$memcache->set('key', $tmp_object, false, 10) or die ("Failed to save data at the server");
echo "Store data in the cache (data will expire in 10 seconds)<br/>\n";
$get_result = $memcache->get('key');
echo "Data from the cache:<br/>\n";

Install SASS on Windows

  • The fastest way to get Ruby on your Windows computer is to use Ruby Installer.
  • After Ruby install Go to command prompt, C:\Ruby200-x64\bin folder use following command to install the SASS.
  • gem install sass
  • check version of SASS using following command
  • sass –v
  • install compass using following command
  • gem install compass
  • After install compass add ruby bin path i.e. C:\Ruby200-x64\bin in your local system environment variables path (My computer => properties => Advanced system settings => Advanced => Enviroment variables => path )
  • Restart the system.
  • Using Command prompt go to your project folder i.e. C:\xampp\htdocs\example
  • Compass watch
  • Change in css file and see the information in command prompt

How to execute PHP code on existing html page?

The way to execute PHP on a .html page is to modify your .htaccess file. This file may be hidden, so depending upon your FTP program you may have to modify some settings to see it. Then you just need to add this line for .html:

AddType application/x-httpd-php .html

If you only plan on including the PHP on one page, it is better to setup this way:
<Files abc.html>
AddType application/x-httpd-php .html
This code will only make the PHP executable on the abc.html file, and not on all of your html pages.

ApacheSolr Configuration in Ubuntu for Drupal

NOTE:  For installing Apache Solr with Drupal on a Windows machine, pls use following link for the setup; instead of the instructions below. Also see the comments on this link, if you face any issues.

Installing Tomcat
sudo apt-get install tomcat7 tomcat7-admin tomcat7-common tomcat7-user tomcat7-docs tomcat7-examples

Start tomcat by typing
  sudo /etc/init.d/tomcat7 start

Security (Not required if installing on same machine)
If you are using ip-tables and installing Apache Solr on an external server,
modify or add the following line to accept the port 8080
-A INPUT -p tcp -m tcp –dport 8080 -j ACCEPT

After installation type http://localhost:8080 or http://serverip:8080 in your browser.Now you should see tomcat welcome page.

Install Solr
(Check for latest version or nightly build on or

unzip apache-solr-1.4.1.tgz
       tar -zxvf apache-solr-1.4.1.tgz

Linking tomcat7 with Apache Solr application

mkdir /usr/share/tomcat7/webapps

This should give you an idea on where your distribution installed tomcat7.
Attention : If your path is different do not forget to also adjust this in the next steps. whereis tomcat7

should show you tomcat7: /etc/tomcat7 /usr/share/tomcat7

copy the war file to the webapps directory

sudo cp apache-solr-1.4.1/dist/apache-solr-1.4.1.war /usr/share/tomcat7/webapps/solr.war

copy the example solr application to a new directory called solr. We will change this example solr application later on to be viable for Drupal 6

sudo cp -R apache-solr-1.4.0/example/solr/ /usr/share/tomcat7/solr/
create our config file
 sudo nano /etc/tomcat7/Catalina/localhost/solr.xml
And fill it with the following configuration :
 <Context docBase="/usr/share/tomcat7/webapps/solr.war" debug="0" privileged="true" allowLinking="true" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="/usr/share/tomcat7/solr" override="true" />
Managing tomcat7 application
 We want to see how and/or if our Solr application is running, we can do this by using the manager application. By default you don't have access to this application so we have to modify the permissions.
     sudo nano /etc/tomcat7/tomcat-users.xml

And modify it so it more or less reflects the same information as shown here.

  <role rolename="admin"/>
  <role rolename="manager"/>
  <user username="nick" password="ateneatech" roles="admin,manager"/>
Drop Tomcat security so Solr can access /usr/share/tomcat7/solr
          sudo nano /etc/default/tomcat7
And modify it so our security is disabled. Be careful if you are running on a server which you do not control 100%!
restart our tomcat service
       sudo /etc/init.d/tomcat7 restart

surf to http://localhost:8080/manager/ and log in with your username and password from above and check if the solr instance is started. If not start and it and check wether or not you receive an error code!
If your application is started, surf to http://localhost:8080/solr/admin and you should see a nice screen!

Linking Drupal 6 with a running Apache Solr
Perform this step if you do not have apache-solr module already enabled i.e. you are adding apache-solr to your app for the first time:

I assume you have Drush installed so we continue with downloading the apachesolr module. Execute this commando in the designated website.
drush dl apachesolr

Perform following steps for all installations of apache solr:
 let's copy our schema that will customize our Apache Solr Instance so it fits the "Drupal" bill.
 sudo cp apachesolr/schema.xml /usr/share/tomcat7/solr/conf/schema.xml

 sudo cp apachesolr/solrconfig.xml /usr/share/tomcat7/solr/conf/solrconfig.xml

Tip: it might be a good idea to use symbolic links so we can easily update our modules and update our schemes if they change …you never know with open source ;-)

Additional : give the folder permissions!
 sudo chown -R tomcat7:root /usr/share/tomcat7/solr/

Enable the module in the modules list and go to the config screen fill in the next parameters:

Host name of your Solr server, e.g. localhost or IP Address or
Solr host name: localhost

Port on which the Solr server listens. Tomcat is 8080 by default.
Solr port: 8080

Path that identifies the Solr request handler to be used.
Solr path: solr

On saving these settings, message “Your site has contacted Apache Solr” will be displayed.
You can now start indexing the existing content on your site using cron and check the amount of indexing done at “admin/settings/apachesolr/index”

PHP: Stopping E-mail Injections

The best way to stop e-mail injections is to validate the input.


function spamcheck($field)
//filter_var() sanitizes the e-mail
$field=filter_var($field, FILTER_SANITIZE_EMAIL);
//filter_var() validates the e-mail
if(filter_var($field, FILTER_VALIDATE_EMAIL))
return TRUE;
return FALSE;

if (isset($_REQUEST[’email’]))
{//if “email” is filled out, proceed

//check if the email address is invalid
$mailcheck = spamcheck($_REQUEST[’email’]);
if ($mailcheck==FALSE)
echo “Invalid input”;
{//send email
$email = $_REQUEST[’email’] ;
$subject = $_REQUEST[‘subject’] ;
$message = $_REQUEST[‘message’] ;
mail(“”, “Subject: $subject”,
$message, “From: $email” );
echo “Thank you for using our mail form”;
{//if “email” is not filled out, display the form
echo “<form method=’post’ action=’mailform.php’>
Email: <input name=’email’ type=’text’ /><br />
Subject: <input name=’subject’ type=’text’ /><br />
Message:<br />
<textarea name=’message’ rows=’15’ cols=’40’>
</textarea><br />
<input type=’submit’ />


In the code above we use PHP filters to validate input:

  • The FILTER_SANITIZE_EMAIL filter removes all illegal e-mail characters from a string
  • The FILTER_VALIDATE_EMAIL filter validates value as an e-mail address