I'm a CPAN admin, and I'm trying to reconcile the line of code counts in Ohloh against some numbers we have.
My issue applies to Perl in particular, but applies equally to many other languages.
The CPAN is as close to a complete source of all Perl as we know of. From our analysis, we know that it contains somewhere in the vicinity of 20,000,000 source lines of code (SLOC).
There are a couple of major Perl projects run outside the scope of the CPAN, but between them they probably only add around 5,000,000 lines of code.
Ohloh can't see into the CPAN, but does scan many of the separate scattered SVN/git repositories that the authors hold the code in.
So assuming Ohloh had perfect coverage of all CPAN feeder repositories, we should expect the amount of Perl counted by Ohloh to be in the vicinity of 25,000,000 lines (plus maybe 50% extra to cover Perl scattered in various smaller projects).
The actual Ohloh Perl SLOC count is in the vicinity of 70,000,000 SLOC.
This is equivalent to two and a half times EXTRA code, on top of the existing 18,000 Perl packages that we would expect Ohloh to find if it had theoretically perfect project coverage.
Suffice it to say that we would be extremely interested in finding out where this code is.
Ohloh makes this somewhat difficult, because the the number of lines of code in each project is buried in a detail page.
It would be extremely beneficial if there was a way to get a list of the projects for each language, sorted by the number of lines of code in that language.
That way we can make a start on finding out where the missing 50,000,000 lines of code are.
I think part of this may be due to the way Ohloh detects what is code and what is comments. From what I can tell, it seems that it treats POD comments as code, but I'm not sure, and I just joined Ohloh today :-)
Here is a list of projects using perl : https://www.ohloh.net/p?q=language%3Aperl (i'm not sure of the sorting order) The extra lines may just be some projects duplicating some perl module in their own repository for their own use. Also, ohloh lines count can get wrong on some projects, there are even some line counts that end up being negative. So, don't thrust them too much :)
I haven't had time to drill into this deeply yet, but some poking around in our data does shed some light on the "missing" lines of code.
We have about 22 million lines of Perl and 10 million lines of Perl that came from SourceForge and Google Code, respectively. That's not counting comments, POD, or blank lines: just actual code. From just these two forges alone, we've basically accounted for your missing code.
Considering SourceForge only, about 1/3 of this Perl code, or about 7 million lines, comes from projects that are primarily written in Perl. That is, if you asked a developer, "what language did you use to write your program?", they would answer "Perl". This is only about 200 total projects.
This means that about 2/3 of the Perl code we found on SourceForge served a supporting role: build scripts, automation, utilities, and libraries. If this pattern holds true across projects everywhere, then this means that about 2/3 of all the Perl in the world is "dark matter" -- it's hidden in the
lib directories of thousands of projects everywhere, and it greatly exceeds the volume of "primary" Perl.
I can't guess what fraction of CPAN is covered by Ohloh. We haven't made an organized effort yet to crawl CPAN in the way that we have covered some other forges, so I believe our coverage is very limited.
I agree that it would be nice to see some reports showing which projects are using which languages. I would also like to be able to slice this report by forge, by year, etc. I'm optimistic that we'll have time to implement these kinds of features soon. In the short term, we're sort of constrained by the fact that our data model is highly optimized for the kind of reports you already see on Ohloh, so other types of queries are somewhat difficult.
This is a pretty interesting question. I'm happy to try to run some queries against our data if you have some specific interests.
That's neat to see. The problem with mining data from CPAN is that the code is not in centralized Subversion repositories like SourceForge. It can be anywhere.
So it might be difficult to cull results from CPAN into something useful.
And worse, some packages might not even use a Version Control System :o
Still, some pretty neat statistics to see.
What I personally wonder is: how many of the Perl projects listed on SourceForge are duplicates of those on CPAN? Would CPAN benefit from a single central repository provided to Perl authors in the same way that PAUSE itself is?
PulkoMandy: The top project in that list is primarily written in C. So that's of no real help :)
Robin Luckey: If I had to guess your CPAN coverage, I have a few clues. My repository alone covers about 2% of CPAN (about 300 packages) and I'd suspect there's probably in the order of 10-20 times what is in mine floating around. So I'd make an educated guess that you cover about a third of the CPAN.
jawnsy: Perl already the svn.perl.org repository that people are able to use if they wish.
But it seems that authors want a lot more control and customisation of their repository.
For example, my repository (which is one of the bigger ones) works efficiently because I've been able to tailor the hell out of it. Using the svn.perl.org repository wouldn't be useful to me.