Installation Guides

There are two main ways to install Yioop: Using its own internal webserver or under an existing web server such as Apache. Yioop will probably run faster using its own internal web server; however, running under a traditional web server is probably slightly more stable. Below, descriptions of how to install in either setting are given for a variety of operating systems.

Demo Install Video

A half hour demo of installing Yioop is available at yioop.com: Yioop Install Demo. On the Yioop.com website the Yioop Tutorials Wiki has video tutorials for several of Yioop's features. This wiki also illustrates the ability of Yioop software to do video streaming.

Install Yioop Without a Web Server

The main idea for all of the instructions below is to first obtain a version of PHP and configure it so that it can run Yioop, then run Yioop. If you already have PHP installed on your machine you might be able to just skip to the steps involving running Yioop on your machine.

Windows

  1. From the Windows 10 Search Task Bar, enter Powershell, right click and run it as administrator.
  2. Install Chocolatey Package Manager:
     Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object
     System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
    
  3. Restart Powershell as administrator.
  4. Next install PHP, Sqlite, and the Atom Editor using Chocolatey, by typing from the power shell prompt:
     choco install php
     choco install sqlite
     choco install atom
    
  5. You should now have on your Desktop a shortcut to the Atom editor. Click on it and within the editor open the file:
     C:\tool\php7.2\php.ini
    
  6. Locate using find in the editor the lines containing the following and remove the leading semi-colons:
     extension=bz2
     extension=curl
     extension=exif
     extension=fileinfo
     extension=gd2
     extension=mbstring
     extension=openssl
     extension=pdo_sqlite
     extension=sqlite3
    
  7. Save C:\tool\php7.2\php.ini
  8. Download Yioop Unzip it into
     C:\yioop
    
  9. From Powershell type:
     cd C:\yioop
     php index.php
    
  10. Yioop should now be running on port 8080. If you want Yioop to run on a different port, in the above you could have typed:
     php index.php some_other_port
    
  11. In a browser, go to the page http://localhost:8080/ . You should see the default search landing page for Yioop. Click sign in and
     Login: root
     Password: (leave blank)
    
  12. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
    The crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  13. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  14. Let it crawl for a while, until you see the Total URLs Seen > 1.
  15. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

macOS

  1. Install the Homebrew package manager. Open a terminal window and type:
     /usr/bin/ruby -e "$(curl -fsSL \
     https://raw.githubusercontent.com/Homebrew/install/master/install)"
    
  2. Install php. Type from the command line:
     brew install php
    
  3. Edit the php.ini file enabling the extensions that are needed for yioop. First, type from the command prompt:
     nano /usr/local/etc/php/7.2/php.ini
    
    Locate in the php.ini file the lines containing:
     extension=bz2
     extension=curl
     extension=exif
     extension=fileinfo
     extension=gd2
     extension=mbstring
     extension=openssl
     extension=pdo_sqlite
     extension=sqlite3
    
    and remove the semicolon that is at the start of the line. Save the php.ini file.
  4. Download Yioop Unzip it onto your Desktop
  5. From the terminal type:
     cd ~/Desktop/yioop
     php index.php
    
  6. Yioop should now be running on port 8080. If you want Yioop to run on a different port, in the above you could have typed:
     php index.php some_other_port
    
  7. In a browser, go to the page http://localhost:8080/ . You should see the default search landing page for Yioop. Click sign in and
     Login: root
     Password: (leave blank)
    
  8. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
    The crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  9. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  10. Let it crawl for a while, until you see the Total URLs Seen > 1.
  11. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

Ubuntu/Debian Linux

The instructions described here have been tested on Ubuntu 18.04 LTS.
  1. Get PHP set-up by running the following commands as needed (you might have already done some). For Ubuntu 18.04 LTS type:
     sudo apt install curl
     sudo apt install php7.2-cli
     sudo apt install php7.2-mbstring
     sudo apt install php7.2-sqlite
     sudo apt install php7.2-curl
     sudo apt install php7.2-gd
     sudo apt install php7.2-xml
     sudo apt install php7.2-bcmath
    
  2. Download Yioop , unzip it into /var/www and use mv to rename the Yioop folder to yioop.
  3. Start Yioop using its own web server:
     cd /var/www/yioop
     php index.php
    
    this will run the web server on port 8080. To run on some other port:
     sudo php index.php some_other_port_number
    
  4. In a browser, go to the page http://localhost:8080/ . You should see the default search landing page for Yioop. Click sign in and fill in the form as:
     Login: root
     Password: (leave blank)
    
  5. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
    The crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  6. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  7. Let it crawl for a while, until you see the Total URLs Seen > 1.
  8. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

HHVM rather than PHP

hhvm is Facebook's open-source virtual machine for executing PHP. It can offer a significant speed-up in performance over running the traditional PHP interpreter before PHP 7. It also supports the Hack language which is a variant on PHP. hhvm works for both Linux and for Macos, but at this time (mid-2018) did not seem to be available for Windows. hhvm seemed to require slightly more memory to work without crashing, so make sure your machine has at least 8GB of memory.
  1. If on a Mac, install the Homebrew package manager.
  2. Install hhvm if it is not already installed. On Linux:
     sudo apt install hhvm
    
    on a Mac:
     sudo brew install hhvm
    
  3. Download Yioop and use mv to rename the Yioop folder to yioop.
  4. cd into the yioop folder.
  5. Run Yioop:
     hhvm index.php
    
    this will run the web server on port 8080. To run on some other port:
     sudo hhvm index.php some_other_port_number
    
  6. In a browser, go to the page http://localhost:8080/ . You should see the default search landing page for Yioop. Click sign in and fill in the form as:
     Login: root
     Password: (leave blank)
    
  7. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
    The crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  8. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  9. Let it crawl for a while, until you see the Total URLs Seen > 1.
  10. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

Install Yioop Under a Web Server

XAMPP on Windows

  1. Download Xampp. These directions were tested on Xampp 5.6.11.
  2. Install Xampp.
  3. Open Control Panel. Go to System => Advanced system settings => Advanced. Click on Environment Variables. Look under System Variables and select Path. Click Edit. Tack onto the end of Variable Values:
     ;C:\xampp\php;
    
    Click OK a bunch times to get rid of windows. Close the Control Panel window. Reopen it and go to the same place to make sure the path variable really was changed. I then restarted the machine to really make sure these settings took effect.
  4. Use the Xampp control panel to start at least Apache.
  5. Download Yioop . Unzip it into
     C:\xampp\htdocs
    
    Rename the downloaded folder yioop (so you now have a folder C:\xampp\htdocs\yioop). Point your browser at:
     http://localhost/yioop/
    
  6. You should see the Yioop landing page. Login with username root and empty password.
  7. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
  8. Crawl Robot Name is what will appear together with a url to a bot.php page in the web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  9. Now go to Manage Crawls. Click on Options. Set the options you would like for your crawl. Click Save.
  10. Type the name of the crawl and start crawl. Let it crawl for a while, until you see the Total URLs Seen > 1.
  11. Click stop crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. Then you can search using this index.

Wamp

  1. Download WampServer. These instructions were tested on the 64 bit version of WampServer 2.5 that came with PHP 5.5.
  2. Download Yioop Unzip it into
     C:\wamp\www
    
  3. Rename the downloaded folder yioop (so you should now have a folder C:\wamp\www\yioop).
  4. Make sure php curl is enabled. To do this use the Wamp dock tool and navigate to wamp => php => extension. Turn on curl. This makes sure that curl is enabled in one of the php.ini files that WAMP uses...
  5. Unfortunately, Wamp has two php.ini files. The one we just edited by doing this is in
     C:\wamp\bin\apache\Apache2.4.9\bin
    
    You need to also edit the php.ini in
     C:\wamp\bin\php\php5.5.12
    
    Depending on your version of Wamp the PHP version number may be different. Open this file in an editor and make sure the line:
     extension=php_curl.dll
    
    doesn't have a semicolon in front of it.
  6. Next go to control panel => system => advanced system settings => advanced => environment variables => system variables =>path. Click edit and add to the path variable:
     ;C:\wamp\bin\php\php5.5.12;
    
    Exit control panel, then re-enter to double check that path really was added to the end. #Restart your PC. Start Apache in Wampserver.
  7. Go to http://localhost/yioop/ in a browser. You should see the default landing page for Yioop. Click Sign In and use the login: root and no password.
  8. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
    The crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  9. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  10. Let it crawl for a while, until you see the Total URLs Seen > 1. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

XAMPP on Mac OSX

  1. Download Xampp. These directions were tested on Xampp 5.6.11.
  2. Install Xampp.
  3. After the install if the Xampp manager-osx.app is running quit it.
  4. In a texteditor open the file:
     /Applications/XAMPP/xamppfiles/etc/httpd.conf
    
    Locate the lines:
     <IfModule unixd_module>
     #
     # If you wish httpd to run as a different user or group, you must run
     # httpd as root initially and it will switch.  
     #
     # User/Group: The name (or #number) of the user/group to run httpd as.
     # It is usually good practice to create a dedicated user and group for
     # running httpd, as with most system services.
     #
     User daemon
     Group daemon
     </IfModule>
    
    Change User daemon to User your_mac_username and Group daemon to Group staff After the change, for me, those two lines became
     User cpollett
     Group staff
    
    These changes are not strictly necessary, but can eliminate headaches if you ever start running any of the Yioop applications at the terminal prompt under your user account. I am assuming if you are using Xampp, it is not for a production server.
  5. Download Yioop . Unzip it into
     /Applications/XAMPP/xamppfiles/htdocs
    
  6. Rename the downloaded folder yioop (so you now have a folder /Applications/XAMPP/xamppfiles/htdocs/yioop).
  7. In a texteditor open the file:
     /Applications/XAMPP/xamppfiles/htdocs/yioop/src/library/CrawlDaemon.php
    
    edit the line:
     $php = "php";
    
    change it to:
     $php = "/Applications/XAMPP/xamppfiles/bin/php";
    
  8. Open the Terminal.app under Applications => Utilities and type the lines:
     sudo chown -R your_mac_username /Applications/XAMPP/xamppfiles
     sudo chown -R root /Applications/XAMPP/xamppfiles/manager-osx.app
    
    Here your_mac_username should be the same username you typed above.
  9. Start Apache by double clicking on Xampp's manager-osx.app, choosing the Manage Servers tab, selecting Apache Web Server, and clicking start.
  10. Point your browser at:
     http://localhost/yioop/
    
  11. You should see the Yioop landing page. Login with username root and empty password.
  12. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
  13. Crawl Robot Name is what will appear together with a url to a bot.php page in the web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  14. Now go to Manage Crawls. Click on Options. Set the options you would like for your crawl. Click Save.
  15. Type the name of the crawl and start crawl. Let it crawl for a while, until you see the Total URLs Seen > 1.
  16. Click stop crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. Then you can search using this index.

macOS / Mac OSX Server

The instructions given here are for OSX Mountain Lion (10.8) or more recent version of OSX/macOS. I will use the terms OSX and macOS interchangeably. Apple changes the positions with which files can be found slightly between versions, so you might have to do a little exploration to find things for earlier OSX versions.
  1. Turn on Apache with PHP enabled.
  2. Not OSX Server: Traditionally, on (pre-Mountain Lion) OSX, one could go to Control Panel => Sharing, and turn on Web Sharing to get the web server running. This option was removed in Mountain Lion, however, from the command line (Terminal), one can type:
     sudo apachectl start
    
    to start the Web server, and similarly,
     sudo apachectl stop
    
    to stop it. Alternatively, to make it so the WebServer starts each time the machine is turned on one can type:
     sudo defaults write /System/Library/LaunchDaemons/org.apache.httpdDisabled -bool false
    
  3. By default, document root is /Library/WebServer/Documents. The configuration files for Apache in this setting are located in /etc/apache2. If you want to tweak document root or other Apache settings, look in the folder /etc/apache2/other and edit appropriate files such as httpd-vhosts.conf or httpd-ssl.conf . Before turning on Web Sharing / the web server, you need to edit the file /etc/apache2/httpd.conf. Let X=5 or X=7 (depending on how old a machine you are using). Replace
     #LoadModule phpX_module libexec/apache2/libphpX.so
    
    with
     LoadModule phpX_module libexec/apache2/libphpX.so
    
    you should also make sure that the rewrite_module is also being loaded. OSX Server: Pre-mountain Lion, OSX Server used /etc/apache2 to store its configuration files. Since Mountain Lion these files are in /Library/Server/Web/Config/apache2 . Within this folder, the sites folder holds Apache directives for specific virtual hosts. Make sure the <Directory> tag where you intend to install Yioop has AllowOverride set to All.
  4. OSX Server comes with Server.app which will actively fight any direct tweaking to configuration files. From Server.app to get the web server running click on Websites. Make sure "Enable PHP web applications" is checked and Websites is On. The default web site is
     /Library/Server/Web/Data/Sites/Default , 
    
    you probably want to click on + under websites and specify document root to be as you like.
  5. For the remainder of this guide, we assume document root for the web server is: /Library/WebServer/Documents. Download Yioop , unpack it into /Library/WebServer/Documents, and rename the Yioop folder to yioop.
  6. Chown this folder to the Webserver user:
     chown -R _www yioop
    
  7. You probably want to make sure Spotlight (Mac's built-in file and folder indexer) doesn't index this folder -- especially during a crawl -- or your system might really slow down. To prevent this, open Control Panel, choose Spotlight, select the Privacy tab, and add the above folder to the list of folder Spotlight shouldn't index. If you are storing crawls on an external drive, you might want to make sure that drive gets automounted without a login. This is useful in the event of a power failure that exceeds your backup power supply time. To do this you can write the preference:
     sudo defaults write /Library/Preferences/SystemConfiguration/autodiskmount \
         AutomountDisksWithoutUserLogin -bool true
    
  8. This will mean the hard drive becomes available when the power comes back. To make your Mac restart when the power is back, under System Preferences => Energy Saver there is a check box next to "Start up automatically after a power failure". Check it.
  9. In a browser, go to the page http://localhost/yioop/ . You should see the default Yioop landing page. Sign-in using the login: root and no password. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
  10. Crawl Robot Name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  11. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  12. Let it crawl for a while, until you see the Total URLs Seen > 1.
  13. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

Ubuntu Linux / Debian (with Suhosin Hardening Patch)

The instructions described here have been tested on Ubuntu 12.04 LTS, Ubuntu 14.04 LTS, Ubuntu 16.04 LTS, and Ubuntu 18.04 LTS.
  1. Get PHP (and optionally Apache) set-up by running the following commands as needed (you might have already done some). For Ubuntu 18.04 LTS type:
     sudo apt install curl
     sudo apt install php7.2-cli
     sudo apt install php7.2-mbstring
     sudo apt install php7.2-sqlite
     sudo apt install php7.2-curl
     sudo apt install php7.2-gd
     sudo apt install php7.2-xml
     sudo apt install php7.2-bcmath
     sudo apt install apache2
     sudo a2enmod php7.2
     sudo a2enmod rewrite
    
    For Ubuntu 16.04 LTS type:
     sudo apt install curl
     sudo apt install apache2 #if not using apache don't need
     sudo apt install php7.0
     sudo apt install libapache2-mod-php7.0
     sudo apt install php7.0-cli
     sudo apt install php7.0-sqlite
     sudo apt install php7.0-curl
     sudo apt install php7.0-gd
     sudo apt install php7.0-mbstring
     sudo apt install php7.0-xml
     sudo apt install php7.0-bcmath
     sudo a2enmod php7.0 #if not using apache don't need
     sudo a2enmod rewrite #if not using apache don't need
    
    For Ubuntu 12.04 LTS or Ubuntu 14.04 LTS type:
     sudo apt-get install curl
     sudo apt-get install apache2 #if not using apache don't need
     sudo apt-get install php5
     sudo apt-get install php5-cli
     sudo apt-get install php5-sqlite
     sudo apt-get install php5-curl
     sudo apt-get install php5-gd
     sudo a2enmod rewrite #if not using apache don't need
    
  2. If you are not using Apache skip ahead to step 7.
  3. After this sequence, depending on which version of Ubuntu you are using above, the files /etc/apache2/mods-enabled/php7.2.conf, /etc/apache2/mods-enabled/php7.0.conf, or /etc/apache2/mods-enabled/php5.conf and /etc/apache2/mods-enabled/php7.2.load, /etc/apache2/mods-enabled/php7.0.load or /etc/apache2/mods-enabled/php5.load should exist and link to the corresponding files in /etc/apache2/mods-available. The sudo a2enmod rewrite line above enables url rewriting in Apache and should create the file /etc/apache2/mods-enabled/rewrite.load. The configuration files for PHP are /etc/php/7.2/apache2/php.ini, /etc/php/7.0/apache2/php.ini, or /etc/php5/apache2/php.ini (for the apache module) and /etc/php/7.2/apache2/php.ini. /etc/php/7.0/apache2/php.ini or /etc/php5/cli/php.ini (for the command-line interpreter). You want to make changes to both configurations. To get a feel for the changes you can make in a texteditor: ed, vi, nano, gedit, etc., modify the line:
     post_max_size = 8M
    
    to
     post_max_size = 32M
    
    This change is not strictly necessary, but will improve performance.
  4. Debian's (not Ubuntu's) PHP version has the Suhosin hardening patch enabled by default. On Yioop before Version 0.941, this caused problems because Yioop made mt_srand calls which were ignored. To fix this you should add to the end of both php.ini files list above (alternatively, you could add to /etc/php5/apache2/conf.d/suhosin.ini and /etc/php5/cli/conf.d/suhosin.ini):
     suhosin.srand.ignore = Off
     suhosin.mt_srand.ignore = Off
    
    This modification is not needed for Version 0.941 and higher. Suhosin hardening also entails a second place where HTTP post requests are limited. You should also set suhosin.post.max_value_length to the same value you set for post_max_size.
  5. Looking in the folders /etc/php5/apache2/conf.d and /etc/php5/cli/conf.d you can see which extensions are being loaded by php. Look for files curl.ini, gd.ini, sqlite.ini to know these extensions will be loaded.
  6. The DocumentRoot for web sites (virtual hosts) served by an Ubuntu Linux machine is typically specified by files in /etc/apache2/sites-enabled. In this example, it was given in a file 000-default and specified to be /var/www/. We are going to install yioop into /var/www/yioop. The Yioop folder has an .htaccess with additional configuration directives for Apache. For these to work, you either need to add before the </VirtualHost> tag in 000-default, lines like:
     <Directory /var/www/yioop >
         Options Indexes FollowSymLinks
         AllowOverride all
     </Directory>
    
    or you need to take the lines from the .htaccess file and add them to a directory tag like the above.
  7. Download Yioop , unpack it into /var/www and use mv to rename the Yioop folder to yioop.
  8. Restart the web server.
     sudo apachectl stop
     sudo apachectl start
    
  9. In a browser, go to http://localhost/yioop/ under Apache. You should see the default search landing page for Yioop. Click sign in and use the login: root and no password.
  10. Now go to Yioop => Configure and alter the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
    
    The crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  11. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  12. Let it crawl for a while, until you see the Total URLs Seen > 1.
  13. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

Centos Linux

These instructions were tested running a Centos 7 image in VirtualBox. They show how to get yioop running under Apache. To get yioop to run as a server by itself, after installing php below, look at the Ubuntu instruction for running Yioop as its own server.
To get started, log in, launch a terminal window, and su root.
  1. CentOS makes use of Secure Linux (SELinux), which greatly restricts the ability of apache to do stuff. To keep things simple turn off SELinux, by edit the file /etc/sysconfig/selinux and setting SELINUX=disabled. Restart the machine.
  2. The image we were using doesn't have Apache installed. At the site suggested for downloading CentOS VMs, some but not all of the images had the nano editor installed. These can be installed with the commands:
     yum install httpd 
     yum install nano 
    
  3. If you didn't su root, then you will need to put sudo before all commands in this guide, and you will have to make sure the user you are running under is in the list of sudoers.
  4. Apache's configuration files are in the /etc/httpd directory. To get rid of the default web landing page, we switch into the conf.d subfolder and disable welcome.conf. To do this, first type the commands:
     cd /etc/httpd/conf.d
     nano welcome.conf
    
    Then using the editor put #'s at the start of each line and save the result. You also want to edit /etc/httpd/conf/httpd.conf to set AllowOverride All in between the <Directory "/var/www/html"> tags
  5. Next we install git, php, and the various php extensions we need:
     yum install git
     yum install php
     yum install php-mbstring
     yum install php-sqlite3
     yum install gd
     yum install php-gd
    
  6. The default Apache DocumentRoot under Centos is /var/www/html. We will install Yioop in a folder /var/www/html/yioop. This can be accessed by pointing a browser at http://127.0.0.1/yioop/ . To download Yioop to /var/www/html/yioop and to create a work directory, we run the commands:
     cd /var/www/html
     git clone http://seekquarry.com/git/yioop.git yioop
     chown -R apache yioop
    
  7. Restart/start the web server:
     service httpd stop
     service httpd start
    
  8. Go to http://localhost/yioop/. You should see the default Yioop landing page. Then enter root for the username and blank for the password to login.
  9. Now go to Yioop => Configure and input the following settings:
     Search Engine Work Directory: (don't change)
     Default Language: (choose the language you want, or for now leave as English)
     Debug Display: (don't change)
     Search Access: (don't change)
     Crawl Robot Name: TestBot
     Robot Description: This bot is for test purposes. It respects robots.txt
     If you having problems with it please feel free to ban it.
    
    Crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say.
  10. Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl.
  11. Let it crawl for a while, until you see the Total URLs Seen > 1.
  12. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index.

CPanel

Generally, it is not practical to do your crawling in a cPanel hosted website. However, cPanel works perfectly fine for hosting the results of a crawl you did elsewhere. Here we briefly described how to do this. In capacity planning your installation, as a rule of thumb, you should expect your index to be of comparable size (number of bytes) to the sum of the sizes of the pages you downloaded.
  1. Download Yioop to your local machine.
  2. In cPanel go to File Manager and navigate to the place you want on your server to serve Yioop from. Click upload and choose your zip file so as to upload it to that location.
  3. Select the uploaded file and click extract to extract the zip file to a folder. Reload the page. Rename the extracted folder, if necessary.
  4. For the rest of these instructions, let's assume it was mysite where the testing is being done. If at this point one browsed to:
     http://mysite.my/yioop/
    
    you should see the landing page of your Yioop instance. You can sign in to this instance using the username root and a blank password.
  5. Go to Manage account and give yourself a better login and password.
  6. Go to Configure. Look at Component Check and make sure it says Checks Passed. Otherwise, you might have to ask your site provider to upgrade things.
  7. cPanel machines tend to be underpowered so you might want to crawl elsewhere using one of the other install guides then upload the crawl results to your cPanel site.
  8. After performing a crawl, go to Manage Crawls on the machine where you preformed the crawl. Look under Previous Crawls and locate the crawl you want to upload. Note its timestamp.
  9. Go to THIS_MACHINES_WORK_DIRECTORY/cache . Locate the folder IndexDatatimestamp. where timestamp is the timestamp of the crawl you want. ZIP this folder.
  10. In FileManager, under cPanel on the machine you want to host your crawl, navigate to
        yioop_data/cache.
    
  11. Upload the ZIP and extract it.
  12. Go to Manage Crawls on this instance of Yioop, locate this crawl under Previous Crawls, and set it as the default crawl. You should now be able to search and get results from the crawl.
You will probably want to uncheck Cache in the Page Options >Search Time activity as in this hosted setting it is somewhat hard to get the cache page feature (where it let's users see complete caches of web-page by clicking a link) of Yioop to work.

Systems with Multiple Queue Servers

This section assumes you have already successfully installed and performed crawls with Yioop in the single queue_server setting and have succeeded to use the Manage Machines to start and stop a queue_server and fetcher. If not, you should consult one of the installation guides above or the general Yioop Documentation .
Before we begin, what are the advantages in using more than one queue_server?
  1. If the queue_servers are running on different processors then they can each be indexing part of the crawl data independently and so this can speed up indexing.
  2. After the crawl is done, the index will typically exist on multiple machines and each needs to search a smaller amount of data before sending it to the name server for final merging. So queries can be faster.
For the purposes of this note we will consider the case of two queue servers, the same idea works for more. To keep things especially simple, we have both of these queue servers on the same laptop. Advantages (1) and (2) will likely not apply in this case, but we are describing this for testing purposes -- you can take the same idea and have the queue servers on different machines after going through this tutorial.
  1. Download and install Yioop as you would in the single queue_server case. But do this twice. For example, on your machine, if you are running under a web server such as Apache, under its document root you might have two subfolders
     somewhere/yioop1
    
    and
     somewhere/yioop2
    
    each with a complete copy of yioop. If you are running Yioop using the built in web server rather than use Apache, make sure to start each instance with a different port number:
     php somewhere/yioop1/index.php 8080
     php somewhere/yioop1/index.php 8081
    
    We will use the copy somewhere/yioop1 as an instance of Yioop with both a name server and a queue server; the somewhere/yioop2 will be an instance with just a queue server.
  2. You should leave the work directories of these two instances at their default values. So work directories of these two instances should be different! For each crawl in the multiple queue server setting, each instance will have a copy of those documents it is responsible for. So if we did a crawl with timestamp 10, each instance would have a WORK_DIR/cache/IndexData10 folder and these folders would be disjoint from any other instance.
  3. On the Configure page for each instance, make sure under the Search Access fieldset Web, RSS, and API are checked.
  4. Next click on Server Settings. Make sure the name server and server key are the same for both instances. I.e., In the Name Server Set-up fieldset, one might set:
     Server Key:123
     Name Server URL:http://yioop_1_url/
    
    The Crawl Robot Name should also be the same for the two instances, say:
     TestBotFeelFreeToBan
    
    but we want the Robot Instance to be different, say 1 and 2.
  5. Go to the Manage Machine element for git/yioop1, which is the name server. Only the name server needs to manage machines, so we won't do this for somewhere/yioop2 (or for any other queue servers if we had them).
  6. Add machines for each Yioop instance we want to manage with the name server. In this particular case, fill out and submit the Add Machine form twice, the first time with:
     Machine Name:Local1
     Machine Url:http://yioop_1_url/
     Is Mirror: unchecked
     Has Queue Server: checked
     Num Fetchers: 1
    
    the second time with:
     Machine Name:Local2
     Machine Url:http://yioop_2_url/
     Is Mirror: unchecked
     Has Queue Server: checked
     Num Fetchers: 1
    
  7. The Machine Name should be different for each Yioop instance, but can otherwise be whatever you want. Is Mirror controls whether this is a replica of some other node -- I'll save that for a different install guide at some point. If we wanted to run more fetchers we could have chosen a bigger number for Num Fetchers (fetchers are the processes that download web pages).
  8. After the above steps, there should be two machines listed under Machine Information. Click the On button on the queue server and the fetcher of both of them. They should turn green. If you click the log link you should start seeing new messages (it refreshes once every 30 seconds) after at most a minute or so.
  9. At this point you are ready to crawl in the multiple queue server setting. You can use Manage Crawl to set-up, start and stop a crawl exactly as in the single queue_server setting.
  10. Perform a crawl and set it as the default index. You can then turn off all the queue servers and fetchers in Manage Machines, if you like.
  11. If you type a query into the search bar of the name server (somewhere/yioop1), you should be getting merged results from both queue servers. To check if this is working... Under configure on the name server (somewhere/yioop1) make sure Query Info is checked and that Use Memcache and Use FileCache are not checked -- the latter two are for testing, we can check them later when we know things are working. When you perform a query now, at the bottom of the page you should see a horizontal rule followed by Query Statistics followed by all the queries performed in calculating results. One of these should be PHRASE QUERY. Underneath it you should see Lookup Offset Times and beneath this Machine Subtimes: ID_0 and ID_1. If these appear you know its working.
When a query is typed into the name server it tacks no:network onto it and asks it of all the queue servers. It then merges the results. So if you type "hello" as the search, i.e., if you go to the url
 http://yioop_1_url/?q=hello
the somewhere/yioop1 script will make in parallel the curl requests
 http://yioop_1_url/?q=hello&network=false&raw=1 
    (raw=1 means no grouping)
 http://yioop_2_url/?q=hello&network=false&raw=1
get the results back, and merges them. Finally, it returns to the user the result. The network=false tells http://yioop_1_url/ to actually do the query lookup rather than make a network request.