Jobeet - Day 17: Search

This post was published as part of the symfony 2008 advent calendar. As this tutorial might have been updated since then, you are advised to read the last version from the symfony 1.2 documentation (for Propel or Doctrine).

Previously on Jobeet

Two days ago, we added some feeds to keep Jobeet users up-to-date with new job posts. Today, we will continue to improve the user experience by implementing the latest main feature of the Jobeet website: the search engine.

The Technology

Before we jump in head first, let's talk a bit about the history of symfony. We advocate a lot of best practices, like tests and refactoring, and we also try to apply them to the framework itself. For instance, we like the famous "Don't reinvent the wheel" motto. As a matter of fact, the symfony framework started its life four years ago as the glue between two existing Open-Source softwares: Mojavi and Propel. And every time we need to tackle a new problem, we look for an existing library that does the job well before coding one ourself from scratch.

Today, we want to add a search engine to Jobeet, and the Zend Framework provides a great library, called Zend Lucene, which is a port of the well-know Java Lucene project. Instead of creating yet another search engine for Jobeet, which is quite a complex task, we will use Zend Lucene.

On the Zend Lucene documentation page, the library is described as follows:

... a general purpose text search engine written entirely in PHP 5. Since it stores its index on the filesystem and does not require a database server, it can add search capabilities to almost any PHP-driven website. Zend_Search_Lucene supports the following features:

  • Ranked searching - best results returned first
  • Many powerful query types: phrase queries, boolean queries, wildcard queries, proximity queries, range queries and many others
  • Search by specific field (e.g., title, author, contents)

This chapter is not a tutorial about the Zend Lucene library, but how to integrate it into the Jobeet website; or more generally, how to integrate third-party libraries into a symfony project. If you want more information about this technology, please refer to the Zend Lucene documentation.

Zend Lucene has already been installed yesterday as part of the Zend Framework installation we did for sending emails.

Indexing

The Jobeet search engine should be able to return all jobs matching keywords entered by the user. Before being able to search anything, an index has to be built for the jobs; for Jobeet, it will be stored in the data/ directory.

Zend Lucene provides two methods to retrieve an index depending whether one already exists or not. Let's create a helper method in the JobeetJobPeer class that returns an existing index or creates a new one for us:

// lib/model/JobeetJobPeer.php
static public function getLuceneIndex()
{
  ProjectConfiguration::registerZend();
 
  if (file_exists($index = self::getLuceneIndexFile()))
  {
    return Zend_Search_Lucene::open($index);
  }
  else
  {
    return Zend_Search_Lucene::create($index);
  }
}
 
static public function getLuceneIndexFile()
{
  return sfConfig::get('sf_data_dir').'/job.'.sfConfig::get('sf_environment').'.index';
}
 

Each time a job is created, updated, or deleted, the index must be updated too.

The save() method

Edit JobeetJob to update the index whenever a job is serialized to the database:

// lib/model/JobeetJob.php
public function save(PropelPDO $con = null)
{
  // ...
 
  $ret = parent::save($con);
 
  $this->updateLuceneIndex();
 
  return $ret;
}
 

And create the updateLuceneIndex() method that does the actual work:

// lib/model/JobeetJob.php
public function updateLuceneIndex()
{
  $index = JobeetJobPeer::getLuceneIndex();
 
  // remove an existing entry
  if ($hit = $index->find('pk:'.$this->getId()))
  {
    $index->delete($hit->id);
  }
 
  // don't index expired and non-activated jobs
  if ($this->isExpired() || !$this->getIsActivated())
  {
    return;
  }
 
  $doc = new Zend_Search_Lucene_Document();
 
  // store job primary key URL to identify it in the search results
  $doc->addField(Zend_Search_Lucene_Field::UnIndexed('pk', $this->getId()));
 
  // index job fields
  $doc->addField(Zend_Search_Lucene_Field::UnStored('position', $this->getPosition(), 'utf-8'));
  $doc->addField(Zend_Search_Lucene_Field::UnStored('company', $this->getCompany(), 'utf-8'));
  $doc->addField(Zend_Search_Lucene_Field::UnStored('location', $this->getLocation(), 'utf-8'));
  $doc->addField(Zend_Search_Lucene_Field::UnStored('description', $this->getDescription(), 'utf-8'));
 
  // add job to the index
  $index->addDocument($doc);
  $index->commit();
}
 

As Zend Lucene is not able to update an existing entry, it is removed first if the job already exists in the index.

Indexing the job itself is simple: the primary key is stored for future reference when searching jobs and the main columns (position, company, location, and description) are indexed but not stored in the index as we will use the real objects to display the results.

Propel Transactions

What if there is a problem when indexing a job or if the job is not saved into the database? Both Propel and Zend Lucene will throw an exception. But under some circumstances, we might have a job saved in the database without the corresponding indexing. To prevent this from happening, we can wrap the two updates in a transaction and rollback in case of an error:

// lib/model/JobeetJob.php
public function save(PropelPDO $con = null)
{
  // ...
 
  $con = $con ? $con : Propel::getConnection(JobeetJobPeer::DATABASE_NAME);
  $con->beginTransaction();
  try
  {
    $ret = parent::save($con);
 
    $this->updateLuceneIndex();
 
    $con->commit();
 
    return $ret;
  }
  catch (Exception $e)
  {
    $con->rollBack();
    throw $e;
  }
}
 

delete()

We also need to override the delete() method to remove the entry of the deleted job from the index:

// lib/model/JobeetJob.php
public function delete(PropelPDO $con = null)
{
  $index = JobeetJobPeer::getLuceneIndex();
 
  if ($hit = $index->find('pk:'.$this->getId()))
  {
    $index->delete($hit->id);
  }
 
  return parent::delete($con);
}
 

Mass delete

Whenever you load the fixtures with the propel:data-load task, symfony removes all the existing job records by calling the JobeetJobPeer::doDeleteAll() method. Let's override the default behavior to also delete the index altogether:

// lib/model/JobeetJobPeer.php
public static function doDeleteAll($con = null)
{
  if (file_exists($index = self::getLuceneIndexFile()))
  {
    sfToolkit::clearDirectory($index);
    rmdir($index);
  }
 
  return parent::doDeleteAll($con);
}
 

Searching

Now that we have everything in place, you can reload the fixture data to index them:

$ php symfony propel:data-load --env=dev

The task is run with the --env option as the index is environment dependent and the default environment for task is cli.

For Unix-like users: as the index is modified from the command line and also from the web, you must change the index directory permissions accordingly depending on your configuration: check that both the command line user you use and the web server user can write to the index directory.

Implementing the search in the frontend is a piece of cake. First, create a route:

job_search:
  url:   /search
  param: { module: job, action: search }
 

And the corresponding action:

// apps/frontend/modules/job/actions/actions.class.php
class jobActions extends sfActions
{
  public function executeSearch(sfWebRequest $request)
  {
    if (!$query = $request->getParameter('query'))
    {
      return $this->forward('job', 'index');
    }
 
    $this->jobs = JobeetJobPeer::getForLuceneQuery($query);
  }
 
  // ...
}
 

The template is also quite straightforward:

// apps/frontend/modules/job/templates/searchSuccess.php
<?php use_stylesheet('jobs.css') ?>
 
<div id="jobs">
  <?php include_partial('job/list', array('jobs' => $jobs)) ?>
</div>
 

The search itself is delegated to the getForLuceneQuery() method:

// lib/model/JobeetJobPeer.php
static public function getForLuceneQuery($query)
{
  $hits = self::getLuceneIndex()->find($query);
 
  $pks = array();
  foreach ($hits as $hit)
  {
    $pks[] = $hit->pk;
  }
 
  $criteria = new Criteria();
  $criteria->add(self::ID, $pks, Criteria::IN);
  $criteria->setLimit(20);
 
  return self::doSelect(self::addActiveJobsCriteria($criteria));
}
 

After we get all results from the Lucene index, we filter out the inactive jobs, and limit the number of results to 20.

To make it work, update the layout:

// apps/frontend/templates/layout.php
<h2>Ask for a job</h2>
<form action="<?php echo url_for('@job_search') ?>" method="get">
  <input type="text" name="query" value="<?php echo isset($query) ? $query : '' ?>" id="search_keywords" />
  <input type="submit" value="search" />
  <div class="help">
    Enter some keywords (city, country, position, ...)
  </div>
</form>
 

Zend Lucene defines a rich query language that supports operations like Booleans, wildcards, fuzzy search, and much more. Everything is documented in the Zend Lucene manual

Unit Tests

What kind of unit tests do we need to create to test the search engine? We obviously won't test the Zend Lucene library itself, but its integration with the JobeetJob class.

Add the following tests at the end of the JobeetJobTest.php file and don't forget to update the number of tests at the beginning of the file:

// test/unit/model/JobeetJobTest.php
$t->comment('->getForLuceneQuery()');
$job = create_job(array('position' => 'foobar', 'is_activated' => false));
$job->save();
$jobs = JobeetJobPeer::getForLuceneQuery('position:foobar');
$t->is(count($jobs), 0, '::getForLuceneQuery() does not return non activated jobs');
 
$job = create_job(array('position' => 'foobar', 'is_activated' => true));
$job->save();
$jobs = JobeetJobPeer::getForLuceneQuery('position:foobar');
$t->is(count($jobs), 1, '::getForLuceneQuery() returns jobs matching the criteria');
$t->is($jobs[0]->getId(), $job->getId(), '::getForLuceneQuery() returns jobs matching the criteria');
 
$job->delete();
$jobs = JobeetJobPeer::getForLuceneQuery('position:foobar');
$t->is(count($jobs), 0, '::getForLuceneQuery() does not return delete jobs');
 

We test that a non activated job, or a deleted one does not show up in the search results; we also check that jobs matching the given criteria do show up in the results.

Tasks

Eventually, we need to create a task to cleanup the index from stale entries (when a job expires for example) and optimize the index from time to time. As we already have a cleanup task, let's update it to add those features:

// lib/task/JobeetCleanupTask.class.php
protected function execute($arguments = array(), $options = array())
{
  $databaseManager = new sfDatabaseManager($this->configuration);
 
  // cleanup Lucene index
  $index = JobeetJobPeer::getLuceneIndex();
 
  $criteria = new Criteria();
  $criteria->add(JobeetJobPeer::EXPIRES_AT, time(), Criteria::LESS_THAN);
  $jobs = JobeetJobPeer::doSelect($criteria);
  foreach ($jobs as $job)
  {
    if ($hit = $index->find('pk:'.$job->getId()))
    {
      $hit->delete();
    }
  }
 
  $index->optimize();
 
  $this->logSection('lucene', 'Cleaned up and optimized the job index');
 
  // Remove stale jobs
  $nb = JobeetJobPeer::cleanup($options['days']);
 
  $this->logSection('propel', sprintf('Removed %d stale jobs', $nb));
}
 

The task removed all expired jobs from the index and then optimizes it thanks to the Zend Lucene built-in optimize() method.

See you Tomorrow

Today, we implemented a full search engine with many features in less than an hour. Every time you want to add a new feature to your projects, check that it has not yet been solved somewhere else. First, check if something is not implemented natively in the symfony framework. Then, check the symfony plugins. And don't forget to check the Zend Framework libraries and the ezComponent ones too.

Tomorrow, we will use some unobtrusive JavaScripts to enhance the responsiveness of the search engine by updating the results in real-time as the user types in the search box. Of course, this will be the occasion to talk about how to use AJAX with symfony.

Comments

In de doDeleteAll method there's a small error.

replace:

if (file_exists($index = self::getLuceneIndex()))


With:
if (file_exists($index = self::getLuceneIndexFile()))
isset($query)? in layout always returns false.
To show query text in a search box the $sf_request->getParameter('query') can be used, or partial (as described here: http://trac.symfony-project.org/wiki/Symfony11LayoutUpgrade)
i get this :(

Fatal error: Call to a member function beginTransaction() on a non-object in C:\
dev\sf\jobeet\lib\model\JobeetJob.php on line 38
Hi,
why not using the Doctrine Searchable behavior in the Doctrine version of this tutorial ?
And also why not calling updateLuceneIndex() inside the Doctrine_Record::postSave() method ?

PS: could be useful to enable comments in the Doctrine section of Jobeet ...
@Gianko
Had same pb => adding "if (is_null($con) ) { $con = Propel::getConnection(JobeetJobPeer::DATABASE_NAME);
}" before the try...catch in JobeetJob::save() solved it.
bugs fixed, thanks!
Not a bug, but in the Doctrine version there is a reference to Propel at the end:
$this->logSection('propel', sprintf('Removed %d stale jobs', $nb));
Hi there!
I have always had a big problem with search engines and spanish words... a very helpful example is this:
My last name is "García", but many times I just write "Garcia". The problem is that if I put "García" in the DB, then when i search for "Garcia" it doesn't get any matches.
Does any body else have had this problem? or any suggestion to solve it?

Thanks!

Comments are closed.

To ensure that comments stay relevant, they are closed for old posts.