Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 Setup

Info

This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.

Terms

  • Nutch – the crawler (fetches and parses websites)
  • HBase – filesystem storage for Nutch (Hadoop component, basically)
  • Gora – filesystem abstraction, used by Nutch (HBase is one of the possible implementations)
  • ElasticSearch – index/search engine, searching on data created by Nutch (does not use HBase, but its down data structure and storage)

Requirements

Install OpenJDK, ant and ElasticSearch via your repository manager of choice (ES can be installed by using the .deb linked above, if you need).

Extract Nutch and HBase somewhere. From now on, we will refer to the Nutch root directory by$NUTCH_ROOT and the HBase root by $HBASE_ROOT.

Setting up HBase

  1. edit $HBASE_ROOT/conf/hbase-site.xml and add
    <configuration>
      <property>
        <name>hbase.rootdirname>
        <value>file:///full/path/to/where/the/data/should/be/storedvalue>
      property>
      <property>
        <name>hbase.cluster.distributedname>
        <value>falsevalue>
      property>
    configuration>
  2. edit $HBASE_ROOT/conf/hbase-env.sh and enable JAVA_HOME and set it to the proper path:
    -# export JAVA_HOME=/usr/java/jdk1.6.0/
    +export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

    This step might seem redundant, but even with JAVA_HOME being set in my shell, HBase just didn’t recognize it.

  3. kick off HBase:
    $HBASE_ROOT/bin/start-hbase.sh

Setting up Nutch

  1. enable the HBase dependency in $NUTCH_ROOT/ivy/ivy.xml by uncommenting the line
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
  2. configure the HBase adapter by editing the $NUTCH_ROOT/conf/gora.properties:
    -#gora.datastore.default=org.apache.gora.mock.store.MockDataStore
    +gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  3. build Nutch
    $ cd $NUTCH_ROOT
    $ ant clean
    $ ant runtime

    This can take a while and creates $NUTCH_ROOT/runtime/local.

  4. configure Nutch by editing $NUTCH_ROOT/runtime/local/conf/nutch-site.xml:
    <configuration>
      <property>
        <name>http.agent.namename>
        <value>mycrawlernamevalue> 
      property>
      <property>
        <name>http.robots.agentsname>
        <value>mycrawlernamevalue> 
      property>
      <property>
        <name>storage.data.store.classname>
        <value>org.apache.gora.hbase.store.HBaseStorevalue>
      property>
      <property>
        <name>plugin.includesname>
        
        <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elasticvalue>
      property>
      <property>
        <name>db.ignore.external.linksname>
        <value>truevalue> 
      property>
      <property>
        <name>elastic.hostname>
        <value>localhostvalue> 
      property>
    configuration>
  5. configure HBase integration by editing $NUTCH_ROOT/runtime/local/conf/hbase-site.xml:
    <configuration>
      <property>
        <name>hbase.rootdirname>
        <value>file:///full/path/to/where/the/data/should/be/storedvalue> 
      property>
      <property>
        <name>hbase.cluster.distributedname>
        <value>falsevalue>
      property>
    configuration>

That’s it. Everything is now setup to crawl websites.

Adding new Domains to crawl with Nutch

  1. create an empty directory. Add a textfile containing a list of seed URLs.
    $ mkdir seed
    $ echo "https://www.website.com" >> seed/urls.txt
    $ echo "https://www.another.com" >> seed/urls.txt
    $ echo "https://www.example.com" >> seed/urls.txt
  2. inject them into Nutch by giving a file URL (!)
    $ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/

Actual Crawling Procedure

  1. Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
    $ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10

    The above command will create job batches for 10 URLs.

  2. Fetch the URLs. We are not clustering, so we can simply fetch all batches:
    $ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all
  3. Now we parse all fetched pages:
    $ $NUTCH_ROOT/runtime/local/bin/nutch parse -all
  4. Last step: Update Nutch’s internal database:
    $ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all

On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.

Putting Documents into ElasticSearch

Easy peasy:

$ $NUTCH_ROOT/runtime/local/bin/nutch index -all

Query for Documents

The usual ElasticSearch way:

$ curl -X GET "http://localhost:9200/_search?query=my%20term"

Install SASS on Windows

  • The fastest way to get Ruby on your Windows computer is to use Ruby Installer.
  • After Ruby install Go to command prompt, C:\Ruby200-x64\bin folder use following command to install the SASS.
  • gem install sass
  • check version of SASS using following command
  • sass –v
  • install compass using following command
  • gem install compass
  • After install compass add ruby bin path i.e. C:\Ruby200-x64\bin in your local system environment variables path (My computer => properties => Advanced system settings => Advanced => Enviroment variables => path )
  • Restart the system.
  • Using Command prompt go to your project folder i.e. C:\xampp\htdocs\example
  • Compass watch
  • Change in css file and see the information in command prompt

Using Twig library with Codeigniter

Quite few days back worked on TWIG the template engine for PHP and tried if can be used in my existing codeigniter setup

Want to share following quick steps which may be useful for using with codeigniter

Step 1

Create Twig cache directory under “application / cache ” folder and make sure its writable..

Step 2

Download the TWIG library HERE and put in libraries folder so it will follow following directory structure

application/libraries/Twig
|-- Error
|-- Extension
|-- Filter
|-- Function
|-- Loader
|-- Node
|   `-- Expression
|       |-- Binary
|       `-- Unary
|-- NodeVisitor
|-- Sandbox
|-- Test
`-- TokenParser

Now create Library files

File 1=> application/libraries/Twig.php

################################################

<?php if (!defined(‘BASEPATH’)) {exit(‘No direct script access allowed’);}
class Twig
{
private $CI;
private $_twig;
private $_template_dir;
private $_cache_dir;
/**
* Constructor
*
*/
function __construct($debug = false)
{
$this->CI =& get_instance();
$this->CI->config->load(‘twig’);
ini_set(‘include_path’,
ini_get(‘include_path’) . PATH_SEPARATOR . APPPATH . ‘libraries/Twig’);
require_once (string) “Autoloader” . EXT;
log_message(‘debug’, “Twig Autoloader Loaded”);
Twig_Autoloader::register();
$this->_template_dir = $this->CI->config->item(‘template_dir’);
$this->_cache_dir = $this->CI->config->item(‘cache_dir’);
$loader = new Twig_Loader_Filesystem($this->_template_dir);
$this->_twig = new Twig_Environment($loader, array(
                ‘cache’ => $this->_cache_dir,
                ‘debug’ => $debug,
));
foreach(get_defined_functions() as $functions) {
             foreach($functions as $function) {
                 $this->_twig->addFunction($function, new Twig_Function_Function($function));
             }
         }
}
public function add_function($name)
{
$this->_twig->addFunction($name, new Twig_Function_Function($name));
}
public function render($template, $data = array())
{
$template = $this->_twig->loadTemplate($template);
return $template->render($data);
}
public function display($template, $data = array())
{
$template = $this->_twig->loadTemplate($template);
/* elapsed_time and memory_usage */
$data[‘elapsed_time’] = $this->CI->benchmark->elapsed_time(‘total_execution_time_start’, ‘total_execution_time_end’);
$memory = (!function_exists(‘memory_get_usage’)) ? ‘0’ : round(memory_get_usage()/1024/1024, 2) . ‘MB’;
$data[‘memory_usage’] = $memory;
$template->display($data);
}
}
###########################################

File 2 application/config/twig.php

##############################################
<?php if (!defined(‘BASEPATH’)) exit(‘No direct script access allowed’);
$config[‘template_dir’] = APPPATH.’views’;
$config[‘cache_dir’] = APPPATH.’cache/twig’;
###################################3

Step 3 USAGE

Put in you controller
$this->load->library('twig');

$data['title'] = "twig loaded";

$this->twig->display('view.html', $data);


simple huh…..

 

reference