Forums : Feedback Forum

Dear Open Hub Users,

We’re excited to announce that we will be moving the Open Hub Forum to https://community.synopsys.com/s/black-duck-open-hub. Beginning immediately, users can head over, register, get technical help and discuss issue pertinent to the Open Hub. Registered users can also subscribe to Open Hub announcements here.


On May 1, 2020, we will be freezing https://www.openhub.net/forums and users will not be able to create new discussions. If you have any questions and concerns, please email us at [email protected]

Project updates

Hi,

I note that you've had some more minor hiccups with update queues, and had a few thoughts as to how they can potentially be addressed.

In the shorter term, the problem you of course are encountering is that you are continually adding more projects as well as trying to update existing ones whenever possible. Of course, this takes server resources for each project, so updates cannot be in real time - probably.

What I'd suggest is a few small things, with a third in the future.

  1. Allow projects to specify update intervals For example, I have a number of projects enlisted in Ohloh (InspIRCd, ircc, obsidian, svnbot, etc) - most of these recieve incredibly infrequent updates or I simply don't care about them that much, so I'd be willing to set an update interval of a month on all except the active ones. This might be a sensible default for people to change, too.

Building on this, it would/should then be possible to decrease update checking if a project decreases in activity, and re-increase the updates as it increases in frequency - so their interval would be hit with a fuzzy modifier to balance things out a bit.

  1. Diffs
    Modify counting to only count changes on diffs, e.g. read each changeset individually via svn diff (or equivilant) and do counting on that, applying the delta to the overall project statistics. You may already do this, I confess to not yet having read ohcount's source.

  2. Notificaition
    Instead of (or possibly in addition to) this, you could operate the updates on a notification basis: that is, the projects that notify you of updates (via a post-commit hook in SVN or whatever) get bumped further up the queue - or perhaps the queue could be entirely comprised of this, but I am not sure how well other VCS support hooks like this. Advantages are that it would update projects that most need updating, possible disadvantage being that it continually bumps a really huge project which means frequent recounts of something large.

  3. Time based
    Schedule delays in updates depending on how long it takes to line count src - sure, you'd be talking small amounts, but they all add up - so more frequent updates of more projects is going to be beneficial. Downside is that this is rather a superficial stopgap to the problem.


Future:
Either allow client-based (as in the people that use ohloh) distributed counting (probably not the most practical of ideas, and perhaps prone to over/under reporting), allow projects to supply their own count data (via a callback HTTP page or something?) - more suitable, but again prone to people doing stupid things for a laugh, so perhaps correct with a periodic ohloh recount (once per month/two)..

--

Just some random thoughts on helping scalability. I do a lot of scalability based stuff at work, though mine is in a slightly different arena of the web, so it was interesting to put my brain to use like this :)

Robin Burchell about 16 years ago
 

Hi w00teh,

Minor hiccups -- you're so forgiving :-).

It's interesting that you would post this now, since I've been spending the last several days focusing on the performance of our server farm and trying to speed up the average update time. I've been thinking about this a lot lately.

You're right that the biggest problem we deal with (and it's a nice problem to have) is that there are always more projects coming into the system. Practically, this means that we have a continually increasing requirement for additional throughput, whether through better algorithms or simply more hardware.

If all we had to do was keep our current projects up-to-date, I'm pretty sure we could get down to daily updates without too much stress. I've been spending the last few days optimizing this side of the equation.

The real problem is that when a new project comes in, we have to download its entire history, and sometimes this takes a long time -- for a large project, this can literally take several weeks, during which time that server can do almost nothing else. We've killed some projects after literally a month of steady download. When several big jobs like that come in at the same time, it can reduce our available horsepower for updates to almost zero.

Apart from spending hundreds of thousands of dollars on new hardware, I'm not sure what strategy to use for this. We like our new users, and we want new projects on the system to have reports as soon as possible. We always assumed it was better to get a new project online quickly than to update existing projects. However, more and more we're starting to feel that real-time updates are a strategic goal for us.

I like the idea of setting specific project update intervals. We probably will attach a timestamp to each repository that says Don't update me until .... Probably, this timestamp would be estimated by looking back over the recent commit history, and picking either an hour, a day, a week, or a month. We could wire up a Please update this project button on the website to let users request a quicker update when they know there are pending changes. We could also use this timestamp to effectively turn off updates of a repository completely if the user knows that the repository has been retired or taken offline.

Regarding incremental updates, we've been doing things incrementally from the start. We couldn't survive without it. Not absolutely everything is incremental, but the really expensive parts are. If we literally updated a project every hour, yes, we might have some things to streamline, but for the most part an update costs only a tiny fraction of what a complete download costs.

I do think there's room in the future for triggered updates -- someone could install a post-commit hook on their source control server to notify Ohloh of changes and add themselves to the update queue. When we're fast enough to make this relevant, I think this would be pretty straightforward to implement.

In the far future of Ohloh, our current relational-database-centric architecture may be completely scrapped and replaced with a map-reduce/couch-DB type of universe where metrics are stored in documents instead of database rows. At that point, Ohloh could become completely decentralized, and metrics could be generated anywhere with access to the source control a la SETI@home, and then the final metrics document could be submitted to Ohloh. This allows highly distributed computation, lets us get behind firewalls, and possibly avoids the need to download code to Ohloh at all. If we had it to do over again, this is the road we would follow.

I'm still not sure what to do about new projects hogging all our server resources away from regular updates, but we're speeding up the updates. And your idea for specifying update intervals is probably coming next.

Just my own ramblings, thanks for your thoughts,

Robin

Robin Luckey about 16 years ago
 

Your idea for a temporary fix of a Please update this project button would be greatly appreciated! :)

Sean Colombo about 16 years ago
 

How are you going to handle incremental updates when you update parts of the analyzing software? For example, I'd expect that an updated ohcount with support for more languages and bugfixes for supported languages might require re-analysis of whole projects.

As for the distributed computing idea, I was going to suggest something like that myself. I wonder if your project could become part of BOINC?

Giuseppe "... about 16 years ago
 

I think that downloading the whole history applies to things like counting commits and attributing them to developers. ohcount should always look at the latest snapshot, and probably is not incremental.

Paolo Bonzini about 16 years ago
 

I'm still not sure what to do about new projects hogging all our server resources away from regular updates, but we're speeding up the updates. And your idea for specifying update intervals is probably coming next.

I think you should first download the latest 10 version numbers for new projects. and then update the whole history on steps of ten (or even more) as regular update.

Anonymous Coward about 16 years ago
 

@Guiseppe -

Yes, this is a real problem for us: when Ohcount is updated, it often requires us to recount huge amounts of existing code on Ohloh. We don't have the horsepower for this right now, so projects are inconsistent in the version of Ohcount they've been counted with. We're recounting as we can, and will recount specific projects on request.

@Paolo -

Yes, Ohcount just works on directory snapshots. The work of parsing commit histories and generating diffs is done externally from Ohcount.

@Jumpin-Banana -

That's almost possible. The only problem is that our history parser works forward in history, not backward. We built it in the forward direction so that we could incrementally update as time goes on. I'm not sure what it would take to reverse the direction; an interesting idea.

Robin Luckey about 16 years ago
 

Just FYI, we're experimenting with a change to the balance between downloading new projects and updating existing projects.

Starting last night, long-running downloads of any kind have only a 1/3 duty cycle: after a download runs for 8 hours, it stop and reschedules itself to resume in 16 hours.

The majority of projects on Ohloh can fully download in less than an hour, and these aren't really a problem for us. However, the really monstrous and slow source control histories will now take much longer to appear on Ohloh for the first time. This will prevent the server farm from getting bogged down by a few big slow downloads.

This has already shown its effect -- last night we updated about 4 times as many projects as the night before.

Robin Luckey about 16 years ago
 

nice !

Anonymous Coward about 16 years ago
 

Could you give an indication as to what the average time between updates for a project is? I'm asking because I've created a few new projects in february, and they heve not seen any updates after the initial one yet.

Camiel Vanderho... about 16 years ago
 

We are currently running about 4 days back, and getting faster. Let me know which projects are not updating and I'll take a look.

Robin Luckey about 16 years ago
 

Nice work Robin. I would have posted earlier but I've been travelling for the past few days.

I hadn't considered the effort to download large projects, but even ignoring the bandwidth for e.g. KDE's repository - I'd hate to imagine how long just fetching the histories take.

I'll put some thought into that myself, and see if there's anything I can come up with, but short of nicing the downloads, throttling the bandwidth use (it would be nice if you could somehow throttle per-project, don't know how possible this is... firewalling isn't my forte) and killing (and resuming) periodically I don't think there's too much more that can be done to assist that. You could perhaps consider stopping and resuming at a lower interval though, as projects that are taking 8 hours are obviously already having problems.

What kind of contention do your servers encounter? IO locking? Memory limitations? Bandwidth?

The more I (and others :-)) know, perhaps the more advice we can offer.

Robin Burchell about 16 years ago
 

Hi Robin,

The project I'm mainly concerned about is ES40 Emulator (http://www.ohloh.net/projects/12354). Last update is feb. 28th, and there has been a lot of activity since then.

Camiel Vanderho... about 16 years ago
 

Hi
My project (Faalis) is not updated for more than two month.
Please take a look.

thanks

Sameer Rahmani about 9 years ago
 

Sameer,

Analyzed on Tuesday, March 17, 2015 based on code collected on Tuesday, March 17, 2015.

Thanks!

ssnow-blackduck about 9 years ago