[ not to promote flame wars ]
I feel sometimes it is becoming more and more necessary to dissect stats (or add two stats types to the currently aggregate-all language stats) into two broad categories:
1) Programming and Scripting Languages Alike
Including, but not limited to: the classical imperative languages or functional, like (in arbitrary order, not necessarily in the order of preference :)): C/C++ dialects, Java, Perl, Ruby, PHP, Python, Pascal/Delphi, Fortran, Erlang, OCaml, Lisp/Scheme dialects, ML, and all their hybrids. And possibly all sorts of shell scripting languages and all Basic variants. (It is understood that this list (or any other below) is incomplete of the languages available in general, or necessarily present today at ohloh).
2) Data Representation, Modeling, Presentation, and Documentation Languages
You could have a new ohloh's website feature for voting in which category to put each gray-area language. :-) Of course, then people may suggest creating more categories of languages, but I believe two, max three is most useful and manageable, and necessary.
Just a suggestion, as it came out as I witnessed a number of projects where I'd want to see such separation. Surely, there will be more to come.
Thanks for listening till this far :-)
Sorry it's taken so long to reply, but you raised a lot of good questions and I felt you deserved a thoughtful response.
Internally, we actually do split languages into two broad categories: "markup" and "procedural". Currently, we consider XML, HTML, and CSS as markup languages, and everything else is procedural. We currently only recognize about two dozen languages, so this list may grow.
We originally did this because we wanted to identify the "primary language" of a project, and we discovered that a lot of projects where coming back with HTML or XML as their primary langauge, which was clearly wrong in most cases. Typically, this happened when a project included a lot of web content or documentation. Now we ignore markup when determining the primary language [with some limits: a project that's 99% XML and 1% shell script will still come back as an XML project]. Ironically, the primary language doesn't even get exposed on the current website. However, we can envision a day where you might do a search for projects written only in a particular language.
A surprising number of Ohloh users don't like the fact that our reports reveal a lot of HTML or XML in their projects. I'm not sure where this resistance comes from, but it feels like the users would like us to report markup line counts separately from "real" languages somehow. Personally, I think that in the future we're going to de-emphasize absolute line counts as a metric in favor of frequency of checkins and relative complexity of those checkins.
I'm a bit puzzled about what to do in the case of a single developer who writes both a lot of HTML and a lot of C++. Is it more correct to label this person as a C++ developer or an HTML author? With equal output in each langauge, I'm sensing from you that we should call this person a C++ developer. If this person had written no C++ at all, then "HTML author" is the correct label. Where, then, is the boundary between the two? How much C++ can a person write before we discredit the HTML?
I think that the right answer is that we shouldn't try to pick just one label for this industrious person. If we want to give a one-line summary of this person's talents, I think we have to list both the C++ and the HTML, and the fact that one language is procedural and another is markup doesn't even enter the picture.
This is a pretty interesting topic for me, so I'd love to hear any more thoughts you have. Over the next few months we're probably going to be developing some deeper analyses for individual developers, so this is a good time to brainstorm.