One of the few things that really bothers me about this system is that, because I make commentary about being personally against the GPL, the code scanner believes my code is GPL - and all I ever do is say that "[I] feel strongly against the GPL" and link to a site I haven't set up yet, http://whyihatethegpl.com/ .
Details are here: http://labs.ohloh.net/ohcount/ticket/292
It seems like there are probably stronger phrases to match against. I don't want my code flagged as containing GPL contamination; it does not contain any GPL code, and between my personal bias against that license combined with that it would make my real choice of license illegal, it's fairly important to me to get this fixed.
Here's one place the erroneous assumption shows up (there may be others):
Any chance of a fix? I really, really don't want to be seen as making GPL code; I wouldn't own that domain if I weren't so rabid.
mforal looked into this today. The problem is a naive string match in the license checker (just as predicted) which marks anything containing the three letters "gpl" as a GPL file.
The offending regex:
Guys, it's been half a year since this forum post, and nearly 14 months since I first reported this on IRC. Given that it's just a question of changing a string to a stricter string, don't you think it's time this got fixed?
It's a real problem that you're mis-measuring license usage this way, and accusing projects that are meant to be available to commercial users of containing a viral license that isn't actually in place. I've been contacted now seven times by people asking me to remove the GPL so they can use my library, and Ohloh is the only thing on the entire internet suggesting that I'm GPL. Lord knows how many users I lost who just kept looking.
PLEASE FIX THIS. IT'S TRIVIALLY EASY TO FIX, AND IT'S BEEN MORE THAN A YEAR.
All you have to do is change "gpl" to "general public license" and the whole problem disappears. >:(
All caps isn't going to deliver what you want, but it has at least earned you a response.
The fix isn't as trivial as you describe.
There's a great deal of source code that uses sloppy language in its licensing text, and uses merely the abbreviation "GPL" to describe itself. If we change our parser as you describe, there will be literally billions of lines of GPL code that go undetected. That would be a major problem for Ohloh.
As far as I'm aware, yours is the only code on Ohloh that receives a false positive for GPL.
What you are really asking for is a special exception in our code that recognizes your anti-GPL essay. Given the number of things we can be working on to improve Ohloh for everyone, it's unfair to make such personal demands.
However, the code is open, and patches speak louder than caps lock. I'm not a fan of applying patches to our code to accommodate a single developer's quirks, but would do so in this case to bring things to a peaceful resolution.
It seems remarkably unlikely that the amount of GPL code out there which cannot be identified by any means other than those three letters is anywhere near the size of the amount of code being misidentified. I'm aware of three projects being misidentified.
What would you have me write as a patch? How am I to patch this without breaking a mystery metric?
I am glad, though, that I finally got a response at least. Took more than a year.
I mean I've already written four patches for you guys, and this is something that I don't have the knowledge to navigate, namely expectation driven decisions regarding how to fix this.
I literally cannot make this patch because I have no idea what random beliefs someone might hold about marking up source which may or may not be appropriately marked in the first place.
Can you give me an example of a project which is GPL but which cannot be identified as such except to put that license on literally everything containing that three letter string?
I'm losing users by leaving my project on your system with this defect in place. Please help me remove the extremely simple logical defect which causes this mis-labelling.
There are any number of ways for a person to fix this, but they all require staff permission and staff judgement calls. No patch I could donate is guaranteed to navigate a mystery belief system, and I'm certain you won't allow me to alter the database schema to allow users to manually specify that they don't use a license that doesn't actually appear anywhere in their product.
So either I mute my opinion in every single file in my project, which is what I've been doing for more than a year waiting for this to be fixed, or I accept that your site declares me as breaking the law, and tells everyone my project is for that my project isn't actually for them.
I'm losing users because you won't fix an obvious license labelling failure.
Please help me understand why. My experience with source code suggests that this is almost certainly not the best way to do this.
Can you show me just one project which requires this extremely poor match?
It's not like I'm asking a lot.
"[I] feel strongly against the General Public License".. problem solved?
The inability to post my domain is the most serious problem. However, there's also the fundamental problem of wanting correctness: I'm aware of three different projects being mis-labelled as GPL already, my own included, and many groups use OhLoh to attempt to measure the prevalence of GPL.
The measurement being afforded to GPL is known to be incorrect, and should be repaired.
I also very much do not want to remove my own domains from every single source file in my library anymore.
I do appreciate that you've been helpful and made patches in the past. I hate to say no, and I'm sorry that this results in what seems a raw deal for you. But there's only so many things we can do in a day, and we have to prioritize.
In some ways, you make my argument for me. I can't predict how code changes will affect a billion lines of parsing any more than you can. However, I know that our current solution works extremely well for the majority of cases -- except, unfortunately, yours.
If you want to create a patch that specifically rejects your anti-GPL header, I'll take it. There's no staff permission or "mystery belief system" that needs negotiating there.
But generalized changes to our GPL detection are not going to be a priority for us unless there is a demonstrable, systemic problem that affects a lot of projects.
"As far as I'm aware, yours is the only code on Ohloh that receives a false positive for GPL."
That is not true. Apache Camel is another project that gets false positives. And I am willing to bet there are others. Personally I think a best effort to fix this should be attempted. Would I be correct in assuming that you wouldn't want Ohloh to be represented by a 3rd party as something it's not?
How would you define a "demonstrable, systemic problem that affects a lot of projects"?
For what is worth, ASF has a fairly strict licensing policy.
In general, inferring that a file like this one that has an unambiguous licensing statement at the top, is licensed under GPL just because the 'GPL' string happens to be found somewhere in a comment is, ahem, ridiculous.
A metric is only valuable if it's reasonably accurate.
I quickly checked and the following Apache projects are also reported to contain GPL code: ServiceMix, Hadoop, Felix and there may be others. Do you want me to do a thorough research?
"But there's only so many things we can do in a day, and we have to prioritize."
Respectfully, a string replacement is a 20 second patch for any competant engineer, and Jason promised me he'd fix this more than eight months ago.
This causes legal difficulties for us, it's trivially easy to fix, and there isn't yet known a project that the more sensible stricter matching mechanism wouldn't catch.
I'd be willing to do it myself, if you'd just point out one of the projects you feel makes this kind of naive matching necessary. I just can't fix a problem I can't see, and you're the only person I've ever seen claim that there's a good reason to do it this way.
Please. All I'm asking for is one half of one minute of your time. I've given you hours of mine. Give us a favor. I'm not alone; when I catch either of the two other project maintainers I know about and ask their permission, I'll tell you two other projects I know about suffering this.
You now know of two Apache projects suffering this problem. That should help you understand that this is neither a rare case nor an edge case, and the significant legal difficulties raised for some of us by your extremely liberal pattern match.
It would be so easy to fix. I'd buy you a pizza to save you a hundred times the time the fix would take just in ordering the pizza.
I even tracked down the regular expression for you. (And by I, I mean mforal actually did it. But it sounds more compelling if I pull an RMS and pretend I did the work.)
It's a super easy fix as long as someone has access to knowledge of the expectations you won't share or the mysterious projects which require this.
Scratch that, on closer read, five known Apache projects are mis-labelled, and there are potentially others.
This is actually a widespread problem, Mr(s) Luckey. And a serious one to boot, both for the projects affected and for people who count on OhLoh statistics to accurately reflect license usage in the real world.
Several times I've seen arguments based on the high prevalence of GPL code that Ohloh showed.
It becomes clear that this long known, trivially fixable defect has led to some quite serious policy errors on the part of people who rely on your data gathering.
Please spare us one minute of your time.
Mr. Zbarcea: thank you for speaking up. A lone voice is often hard to hear.
It's my expectation that once this defect is resolved, GPL usage will see an immediate precipitous drop. I fully expect to see a minimum of 5% usage loss, and would not at all be surprised to see 15% or more.
I understand Robin's point of view, and would agree that the fix is not that trivial. Personally I'd like to see a well understood and documented way of determining the licensing of a particular artifact. Searching for 'general public license' instead of 'gpl' is (almost) equally random in my mind, and may lead to false negatives instead of false positives.
The question is what is a commonly accepted way of determining the licensing of an artifact? Look for 'GPL' or other licensing indicators only in comments near the beginning of the file, before code? Not sure what the right solution is here.
I also fail to see the legal difficulties, as it's not that obvious how ohloh metrics could be grounds for a lawsuit. I see however ohloh's reputation being at stake and therefore think that this issue should be addressed. And I am willing to pitch in with a pizza too.
If Ohloh were to provide an ohloh specific comment flag that overrode all other things detected for a file, a person aware of the problem could fix it.
It's not an ideal response, but it allows you to retain current behavior without hurting the several well known large projects which are being mis-labelled this way.
I contend that without counter-examples, the sensible solution is still to write a stricter match, and that with counter-examples, the sensible solution is ... still to write a stricter match.
However, this would allow those of us suffering the current defect to at least repair it in our projects.
It isn't ideal, but it's workable.
Can you make that happen, at least?
Ah crap, Mr. Zbarcea beat me to it.
Hadrian: you're well spoken and much more pleasant than I am. I'm glad you're here to pave over my personality. ;)
Mr. Zbarcea: it's important in many organizations to do due diligence in eliminating misrepresentations of project code. Even the suggestion of GPL virality, when well known by company officers, is a significant issue for companies which are sensitive to the possibility that they would be required to turn over their proprietary work.
As such, either they spend for a lawyer to examine software to verify that the false label Ohloh presents is indeed false, satisfying their legally required due diligence - always a cheap and rapid process - or they say "well, I guess I'll find a replacement."
The reason that things like Cisco, Verizon and Monsoon's GPL violations happen is because of companies which assume things are okay. It's expensive to prove it to legal requirements.
My free software is very expensive to corporate users thanks to this defect in Ohloh.
I found actual GPL code in many of these projects.
Here we found GPL in the following locations:
/src/compiler/cil/ocamlutil/intmap.ml* /src/compiler/cil/src/ext/pta/setp.ml* /src/judy/JudySL/JudySL.c /src/tre/*.c, *.h (many files)
Many files in this directory:
It appears that the Hadoop code contains GPL in
ltmain.sh and related files, similar to ActiveMQ. However, I'm having trouble verifying -- it looks like the Hadoop SVN URL has changed recently. Do you know the new URL?
Here we did falsely trigger LGPL. Ironically, our detector triggers because of a comment in the root
pom.xml file, which specifically excludes a portion of the xtream lib precisely because it is LGPL. It's conceivable (but a long shot) that this could be fixed by ignoring any comment portions of a
pom.xml file that occur within an
exclude element. However, I have to imagine this is the only such case on Ohloh.
The Camel example is interesting. It is in general difficult to distinguish between comments that invoke a license and comments that merely mention a license. In this way, Camel is like the Ohcount library itself, which tests positive for many licenses simply because it contains code that deals with licenses. I have to admit I don't see any easy fix for the case in Camel.
Frustratingly, the problems with Camel, Ohcount, and John's code are all the same: The Ohcount software can't actually read and understand a license agreement. That's the core of this whole rambling thread.
Our software is not a lawyer. Flawless license detection is not our goal.
What we can do is say, "Hey, there's something in this file that needs looking at," and that turns out to be very helpful. We've turned up hidden GPL in a lot of projects, and false positives are quite rare.
There are always going to be projects with a few stray detections. As time goes on, we find patterns in these, and we get better at ruling them out. However, it's an inefficient use of our time to dwell on individual cases unless they are symptomatic of a lot of similar cases. That's simply not true here.
Nobody's asking for flawless. "Good" would be sufficient.
These are not "a few stray detections", Mr(s) luckey; these are now a significant fraction of Apache projects. I'm aware, as I continue to remind you, of several other false positives.
I would love to know the names of some of the projects you refer to which make this extremely broad matching strategy desirable. To date, only false positives are known. No stray positives are known, suggesting either that there is a significant paucity of information or that this is a mistaken behavior.
"We've turned up hidden GPL in a lot of projects"
"and false positives are quite rare."
Well, first I was the only one. Then someone turned up a bunch of others on their very first look; now they're "quite rare".
Are you basing this solely on the expectation that because you've never looked for them and aren't aware of others, there must not be any? Do you have any actual reason to believe they're rare?
From here, they appear to actually be quite common.
Both of us suggested a mechanism for fixing this unequivocally which would be trivial for Ohloh to implement. Have you come to an opinion yet on that mechanism, sir?
First off, thanks for you prompt and detailed reply. You guys are doing a great job, many thanks for that. And please accept that my critique is only intended to make a great tool even better.
In another thread you state that "The license detector is not incredibly smart". So for Camel the report is not good, ServiceMix same thing Explicit exclusion of a dependency because of it's LGPL license caused ohloh to interpret it as depending on a LGPL project, which is exactly the same situation as John's if my understanding is correct. For ActiveMQ the sandbox does not get into the distro afaik, so it should not be enlisted, I'll follow up on that, thanks.
"Flawless license detection is not our goal." That's fine, but then I suggest putting a large disclaimer on the https://www.ohloh.net/p//analyses/latest linking to a page that describes how license determination is done and that it could be flowed, maybe with some examples. Or, if you know that the metric is flawed don't publish it.
However, I would much prefer improving it because the reality is that people do use your metrics, they like simple things and don't have time to research what a metric really means. I like your idea of ignoring everything (including comments) in in a pom. Making changes in the code to accommodate a moving target like the ohloh licensing algorithm is imho not an option.
One more thing, I tried to figure out how you found out what the offending files are. I didn't see a way to get to them from the ohloh site, so I assume you looked at logs or something. Is it possible for instance to link to the offending files and have some solution a la wikipedia in which trusted editors make a determination if the algorithm was correct? i.e. if the algorithm determines that a file is gpl licensed, me as the manager of the project say no, and then humans cast their decision? If say N people agree, the algorithm decision is overruled. There should not be many files that require human intervention, and if it's done by a community it should be feasible. Just a thought.