4
I Use This!
Moderate Activity

News

Analyzed about 14 hours ago. based on code collected about 17 hours ago.
Posted over 8 years ago
In preparation for the next Yioop crawl, I am doing preliminary testing over at findcan.ca. One thing that I have now just added for the next crawl is support of rel canonical links. This will hopefully improve the result quality by making it easier ... [More] for Yioop to do deduplication. The changes are right now just in the git repository, but will appear in the download version as of version 1.02. [Less]
Posted over 8 years ago
I finally have hardware in place and have concluded preliminary testing on the current version of Yioop so that I can do a new larger scale which I am starting today. My goal is something larger than my current record of 1/3billion pages crawled and ... [More] indexed. Yioop still consists of six Mac Minis, but now each Mini has 8TB attached to it -- twice what it had before. Since the last major crawl a new summarizer has been added to Yioop, there has been improved support for Office documents and epub, and several new stemmers have also been added. Hopefully, this will be a cool crawl! [Less]
Posted over 8 years ago
I'm getting close to releasing to Version 1.02. Some new features that I've added in the last week or so are the ability in the wiki to have media list and presentation pages. On a wiki page you can upload media files which can then be inserted into ... [More] the wiki page you are editing. These can be images, video, sounds files, etc. Wiki pages now have a Settings form in which you can set the Title, Author, and meta description and robot behavior for when the page is viewed in a logged in context. This form also controls the wiki page type. Besides the default page type, the media list type allows one to have a page which just lists out all the files you have associated with a page. This is convenient to make galleries of images or videos, or playlists of songs. The Presentation type let's you represent your page as a slide show presentation much like a powerpoint. "...." on a line by itself is the wiki syntax for separating one slide from the next. Finally, one last fun thing I've added support for is to crawl sites using the Gopher Protocol. This was an interesting alternative to web which is actually older and may be more useful for the visually impaired. [Less]
Posted over 8 years ago
Since mid-July, I have been testing the Yioop software on the Findcan.ca site in preparation for a larger scale crawl. I have just written a post on Crawling a Single Country to its blog to explain what I was doing with this testing. Hopefully, other people who want to make a niche crawl will find it interesting.
Posted over 8 years ago
Freecode / Freshmeat used to be a great place to announce versions of free/open source software you were hosting yourself. Each new version of Yioop was announced there. Unfortunately, it shuttered its door to new announcements June 18, 2014, so I ... [More] couldn't even announce Version 1.0 there. The old entries were left up, but I figured I should capture Yioop's entry before it completely bit the dust (Internet Archive can sometimes be flaky). I create a Wiki page with old entry which can be accessed at: Old Yioop Free Code Entry [Less]
Posted over 8 years ago
Yioop Version 1.0 is right now around the corner! Version 1.0 of Yioop introduces discussion boards and wikis as part of groups. There is new a built in PUBLIC group whose discussion board is used for the Yioop.com blog. Have moved the old Yioop.com ... [More] blog to a wiki page: Old Group Blog After this version of Yioop is released I will start modifying my hardware for a new crawl. Ciao for now! Chris [Less]
Posted over 8 years ago
Over the weekend I was testing turning on PHP's built-in opcache to see how it would affect Yioop's performance. Basically, opcache will cache in shared memory PHP scripts after they have been converted to bytecode. This means these scripts don't ... [More] have to be reloaded and converted to bytecode on each request. The opcache manual page didn't have specific instructions for osx, so I thought I'd briefly mention them here, since that is what I am running. On OS X, if you're running Yosemite or later, php opcache will be installed, you just need to enable it. There are a couple php.ini files you could edit. /etc/php.ini or if you have the Server.app /Library/Server/Web/Config/php/php.ini If /etc/php.ini doesn't exist yet, there should be a file /etc/php.ini.default that you can copy to /etc/php.ini. I recommend editing /etc/php.ini since that location is more cross-platform. Within this file you want to tell php to land this extension by having the line: zend_extension=opcache.so other than that you add configuration setting either as described on the manual site or by tweaking them. Yioop seems to eat a little more memory than the basic configuration they had so I bumped up the cache setting for a few things to get: realpath_cache_ttl = 120 zend_extension=opcache.so opcache.memory_consumption = 256 opcache.interned_strings_buffer = 16 opcache.max_accelerated_files = 4000 opcache.validate_timestamps = 1 opcache.revalidate_freq = 0 opcache.fst_shutdown = 1 output_buffering=4096 implicit_flush = false; Only the setting beginning with opcache above are directly opcache related, the other are some minor tweaks I threw in for good measure. Be sure to restart the web server after making any changes. As an example of the improvement, a blank query before using opcache took about .3 seconds (i.e., a query that should be doing nothing, so the time was just reading and parsing PHP). After turning on opcache the time was reduced to 0.03 seconds. [Less]
Posted over 8 years ago
Yioop has been crawling for over three months now and is somewhere between 300 and 400 million pages. The first month was a little bit slow as each of my Mac Minis was updated to OSX Mavericks and I was still working out the kinks in the latest ... [More] iteration of the crawler. I have now also switched from the slightly over a year old crawl that I had before as a default index to the current active crawl. This will probably make search results a little slow for now because everything is being served off regular hard drives. I am hoping to have saved for large enough SSDs to put the dictionary and posting lists on by the time I stop the current crawl in a few months. All of the information learned working out the kinks in the current crawl was used in creating Version 2.0 of Yioop which was just released today! Here is the Official Yioop Version 2.0 Release Announcement. [Less]
Posted almost 9 years ago
Yioop has been crawling for over three months now and is somewhere between 300 and 400 million pages. The first month was a little bit slow as each of my Mac Minis was updated to OSX Mavericks and I was still working out the kinks in the latest ... [More] iteration of the crawler. I have now also switched from the slightly over a year old crawl that I had before as a default index to the current active crawl. This will probably make search results a little slow for now because everything is being served off regular hard drives. I am hoping to have saved for large enough SSDs to put the dictionary and posting lists on by the time I stop the current crawl in a few months. All of the information learned working out the kinks in the current crawl was used in creating Version 2.0 of Yioop which was just released today! Here is the Official Yioop Version 2.0 Release Announcement. [Less]
Posted almost 9 years ago
Over the weekend I was testing turning on PHP's built-in opcache to see how it would affect Yioop's performance. Basically, opcache will cache in shared memory PHP scripts after they have been converted to bytecode. This means these scripts don't ... [More] have to be reloaded and converted to bytecode on each request. The opcache manual page didn't have specific instructions for osx, so I thought I'd briefly mention them here, since that is what I am running. On OS X, if you're running Yosemite or later, php opcache will be installed, you just need to enable it. There are a couple php.ini files you could edit. /etc/php.ini or if you have the Server.app /Library/Server/Web/Config/php/php.ini If /etc/php.ini doesn't exist yet, there should be a file /etc/php.ini.default that you can copy to /etc/php.ini. I recommend editing /etc/php.ini since that location is more cross-platform. Within this file you want to tell php to land this extension by having the line: zend_extension=opcache.so other than that you add configuration setting either as described on the manual site or by tweaking them. Yioop seems to eat a little more memory than the basic configuration they had so I bumped up the cache setting for a few things to get: realpath_cache_ttl = 120 zend_extension=opcache.so opcache.memory_consumption = 256 opcache.interned_strings_buffer = 16 opcache.max_accelerated_files = 4000 opcache.validate_timestamps = 1 opcache.revalidate_freq = 0 opcache.fst_shutdown = 1 output_buffering=4096 implicit_flush = false; Only the setting beginning with opcache above are directly opcache related, the other are some minor tweaks I threw in for good measure. Be sure to restart the web server after making any changes. As an example of the improvement, a blank query before using opcache took about .3 seconds (i.e., a query that should be doing nothing, so the time was just reading and parsing PHP). After turning on opcache the time was reduced to 0.03 seconds. [Less]