I Use This!
Very High Activity

News

Analyzed about 17 hours ago. based on code collected 1 day ago.
Posted over 6 years ago by Michael Henretty
Since the launch of Common Voice, we have collected hundreds of thousands of voice samples through our website and iOS app. Today, we are releasing a first version of that voice collection into the public domain.From our beginning, Mozilla has relied ... [More] on the creativity, compassion, and resourcefulness of people all over the world to help us build and promote the web as a global public resource accessible to all. This has been the foundation of our experimental work in the field of machine learning and speech recognition, and in building a large, high-quality voice data resource with Common Voice.This collection contains nearly 400,000 recordings from 20,000 different people, resulting in around 500 hours of speech. To date it is already the second largest publicly available voice dataset that we know about, and people around the world are adding and validating new samples all the time!You can go download the data right now!The Common Voice Download PageHaving ourselves experienced how difficult it can be to find publicly available data for our speech technology work, we also provide links to all the other large voice collections we know about on the site. And we are eager to continue growing the website as a central hub for voice data.When we look at today’s voice ecosystem, we see many developers, makers, startups, and researchers who want to experiment with and build voice-enabled technologies. But most of us only have access to fairly limited collection of voice data; an essential component for creating high-quality speech recognition engines. This voice data can cost upwards of tens of thousands of dollars and is insufficient in scale for creating speech recognition at a level people expect. By providing this new public dataset, we want to help overcome these barriers and make it easier to create new and better speech recognition systems (like our own Deep Speech). We’ve started with English, but we will soon support every language. With our parallel work on an open source speech-to-text engine, we hope to open up speech technology so that more people can get involved, innovate, and compete with the larger players.Are you interested in learning about our open-source speech recognition project “Deep Speech” and how Common Voice data can be used to create better speech recognition products? Reuben Morais from Mozilla’s Machine Learning team just published an article about their “Journey to <10% Word Error Rate”. It provides a compelling summary of the challenges and learnings while working towards the team’s first open-source speech recognition engine model, which has been released today on their github repository!We continue to welcome collaborators on Common Voice. Please reach out with any ideas that you have about how we can work together, to let us know how you are using the data, or to give us feedback on how this project could be more useful.We’d like to thank Mycroft, SNIPS, Bangor University, LibriSpeech, VoxForge, TED-LIUM, Tatoeba.org, Mythic, SAP, and of course all our contributors on github. We couldn’t have made this progress this without you!We are also constantly aiming to improve the quality of our dataset. Head on over to the Common Voice website now and help us verify the recordings which is equally important as donating your voice.Sharing Our Common Voice — Mozilla Releases Second Largest Public Voice Data Set was originally published in Mozilla Open Innovation on Medium, where people are continuing the conversation by highlighting and responding to this story. [Less]
Posted over 6 years ago by Air Mozilla
mconley livehacks on real Firefox bugs while thinking aloud.
Posted over 6 years ago by Air Mozilla
This is the SUMO weekly call
Posted over 6 years ago by Heather West
Imagine that you’re surfing the web, and someone sends you a link to a clever tweet – so you click on it, only to see a message from your ISP: “We’re sorry, you don’t have the Social Browsing Package. Would you like to add Social Browsing for $10 a ... [More] month? It offers access to Facebook, Twitter, Instagram, …” And later, your ISP prompts you to subscribe to the video service they own instead of Netflix to save money on data. This hasn’t happened in the US, but it could if ISPs don’t have clear rules of the road that protect net neutrality and the open web. Last week, the FCC made the disappointing – but unsurprising – decision to schedule a vote in December to overturn net neutrality protections. We are on the heels of Cyber Monday: the most trafficked day for online commerce, a day clearly demonstrating the value of an open, neutral internet (just imagine having to pay your ISP a separate fee to use Jet instead of Amazon – that’s bad for everyone!). But this FCC doesn’t seem to care about the public interest, instead hellbent on rolling back our established net neutrality protections. As we said last week, “internet traffic must be treated equally, without discrimination against content or type of traffic.”  We urge the FCC to take its vote to kill net neutrality off the agenda, but our hopes aren’t high. And without these rules, everyday users and small businesses will pay the price, free speech will suffer, and competition and innovation will be eroded. That’s why we’re releasing our net neutrality framework today, outlining what it really takes to have net neutrality. In our framework, we offer some guidance on what is needed to protect the future of the internet and the economic and social benefits it offers. And we look forward to working with legislators and litigators alike. We encourage everyone to read proposed solutions and get involved. Tell the FCC and Congress to protect your access to the internet with net neutrality. We need to remember that the stakes are high, and think about how to respond to the chaos the FCC is creating by rolling back net neutrality protections. We simply cannot let Internet Service Providers win by creating fast lanes, slowing down, or even blocking traffic for innovators and businesses that want a fair shot – and their users deserve better too.   The post What does it take to get net neutrality? appeared first on Open Policy & Advocacy. [Less]
Posted over 6 years ago by Sean White
With the holiday, gift-giving season upon us, many people are about to experience the ease and power of new speech-enabled devices. Technical advancements have fueled the growth of speech interfaces through the availability of machine learning tools ... [More] , resulting in more Internet-connected products that can listen and respond to us than ever before. At Mozilla we’re excited about the potential of speech recognition. We believe this technology can and will enable a wave of innovative products and services, and that it should be available to everyone. And yet, while this technology is still maturing, we’re seeing significant barriers to innovation that can put people first. These challenges inspired us to launch Project DeepSpeech and Project Common Voice. Today, we have reached two important milestones in these projects for the speech recognition work of our Machine Learning Group at Mozilla. I’m excited to announce the initial release of Mozilla’s open source speech recognition model that has an accuracy approaching what humans can perceive when listening to the same recordings. We are also releasing the world’s second largest publicly available voice dataset, which was contributed to by nearly 20,000 people globally. An open source speech-to-text engine approaching user-expected performance There are only a few commercial quality speech recognition services available, dominated by a small number of large companies. This reduces user choice and available features for startups, researchers or even larger companies that want to speech-enable their products and services. This is why we started DeepSpeech as an open source project. Together with a community of likeminded developers, companies and researchers, we have applied sophisticated machine learning techniques and a variety of innovations to build a speech-to-text engine that has a word error rate of just 6.5% on LibriSpeech’s test-clean dataset. In our initial release today, we have included pre-built packages for Python, NodeJS and a command-line binary that developers can use right away to experiment with speech recognition. Building the world’s most diverse publicly available voice dataset, optimized for training voice technologies One reason so few services are commercially available is a lack of data. Startups, researchers or anyone else who wants to build voice-enabled technologies need high quality, transcribed voice data on which to train machine learning algorithms. Right now, they can only access fairly limited data sets. To address this barrier, we launched Project Common Voice this past July. Our aim is to make it easy for people to donate their voices to a publicly available database, and in doing so build a voice dataset that everyone can use to train new voice-enabled applications. Today, we’ve released the first tranche of donated voices: nearly 400,000 recordings, representing 500 hours of speech. Anyone can download this data. What’s most important for me is that our work represents the world around us. We’ve seen contributions from more than 20,000 people, reflecting a diversity of voices globally. Too often existing speech recognition services can’t understand people with different accents, and many are better at understanding men than women — this is a result of biases within the data on which they are trained. Our hope is that the number of speakers and their different backgrounds and accents will create a globally representative dataset, resulting in more inclusive technologies. To this end, while we’ve started with English, we are working hard to ensure that Common Voice will support voice donations in multiple languages beginning in the first half of 2018. Finally, as we have experienced the challenge of finding publicly available voice datasets, alongside the Common Voice data we have also compiled links to download all the other large voice collections we know about. Our open development approach We at Mozilla believe technology should be open and accessible to all, and that includes voice. Our approach to developing this technology is open by design, and we very much welcome more collaborators and contributors who we can work alongside. As the web expands beyond the 2D page, into the myriad ways where we connect to the Internet through new means like VR, AR, Speech, and languages, we’ll continue our mission to ensure the Internet is a global public resource, open and accessible to all. The post Announcing the Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Dataset appeared first on The Mozilla Blog. [Less]
Posted over 6 years ago by Sean White
With the holiday, gift-giving season upon us, many people are about to experience the ease and power of new speech-enabled devices. Technical advancements have fueled the growth of speech interfaces through the availability of machine learning tools ... [More] , resulting in more Internet-connected products that can listen and respond to us than ever before. At Mozilla we’re excited about the potential of speech recognition. We believe this technology can and will enable a wave of innovative products and services, and that it should be available to everyone. And yet, while this technology is still maturing, we’re seeing significant barriers to innovation that can put people first. These challenges inspired us to launch Project DeepSpeech and Project Common Voice. Today, we have reached two important milestones in these projects for the speech recognition work of our Machine Learning Group at Mozilla. I’m excited to announce the initial release of Mozilla’s open source speech recognition model that has an accuracy approaching what humans can perceive when listening to the same recordings. We are also releasing the world’s second largest publicly available voice dataset, which was contributed to by nearly 20,000 people globally. An open source speech-to-text engine approaching user-expected performance There are only a few commercial quality speech recognition services available, dominated by a small number of large companies. This reduces user choice and available features for startups, researchers or even larger companies that want to speech-enable their products and services. This is why we started DeepSpeech as an open source project. Together with a community of likeminded developers, companies and researchers, we have applied sophisticated machine learning techniques and a variety of innovations to build a speech-to-text engine that has a word error rate of just 6.5% on LibriSpeech’s test-clean dataset. In our initial release today, we have included pre-built packages for Python, NodeJS and a command-line binary that developers can use right away to experiment with speech recognition. Building the world’s most diverse publicly available voice dataset, optimized for training voice technologies One reason so few services are commercially available is a lack of data. Startups, researchers or anyone else who wants to build voice-enabled technologies need high quality, transcribed voice data on which to train machine learning algorithms. Right now, they can only access fairly limited data sets. To address this barrier, we launched Project Common Voice this past July. Our aim is to make it easy for people to donate their voices to a publicly available database, and in doing so build a voice dataset that everyone can use to train new voice-enabled applications. Today, we’ve released the first tranche of donated voices: nearly 400,000 recordings, representing 500 hours of speech. Anyone can download this data. What’s most important for me is that our work represents the world around us. We’ve seen contributions from more than 20,000 people, reflecting a diversity of voices globally. Too often existing speech recognition services can’t understand people with different accents, and many are better at understanding men than women — this is a result of biases within the data on which they are trained. Our hope is that the number of speakers and their different backgrounds and accents will create a globally representative dataset, resulting in more inclusive technologies. To this end, while we’ve started with English, we are working hard to ensure that Common Voice will support voice donations in multiple languages beginning in the first half of 2018. Finally, as we have experienced the challenge of finding publicly available voice datasets, alongside the Common Voice data we have also compiled links to download all the other large voice collections we know about. Our open development approach We at Mozilla believe technology should be open and accessible to all, and that includes voice. Our approach to developing this technology is open by design, and we very much welcome more collaborators and contributors who we can work alongside. As the web expands beyond the 2D page, into the myriad ways where we connect to the Internet through new means like VR, AR, Speech, and languages, we’ll continue our mission to ensure the Internet is a global public resource, open and accessible to all. The post Announcing the Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Dataset appeared first on The Mozilla Blog. [Less]
Posted over 6 years ago by Reuben Morais
At Mozilla, we believe speech interfaces will be a big part of how people interact with their devices in the future. Today we are excited to announce the initial release of our open source speech recognition model so that anyone can develop ... [More] compelling speech experiences. The Machine Learning team at Mozilla Research has been working on an open source Automatic Speech Recognition engine modeled after the Deep Speech papers (1, 2) published by Baidu. One of the major goals from the beginning was to achieve a Word Error Rate in the transcriptions of under 10%. We have made great progress: Our word error rate on LibriSpeech’s test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance. This post is an overview of the team’s efforts and ends with a more detailed explanation of the final piece of the puzzle: the CTC decoder. The architecture Deep Speech is an end-to-end trainable, character-level, deep recurrent neural network (RNN). In less buzzwordy terms: it’s a deep neural network with recurrent layers that gets audio features as input and outputs characters directly — the transcription of the audio. It can be trained using supervised learning from scratch, without any external “sources of intelligence”, like a grapheme to phoneme converter or forced alignment on the input. This animation shows how the data flows through the network. In practice, instead of processing slices of the audio input individually, we do all slices at once.   The network has five layers: the input is fed into three fully connected layers, followed by a bidirectional RNN layer, and finally a fully connected layer. The hidden fully connected layers use the ReLU activation. The RNN layer uses LSTM cells with tanh activation. The output of the network is a matrix of character probabilities over time. In other words, for each time step the network outputs one probability for each character in the alphabet, which represents the likelihood of that character corresponding to what’s being said in the audio at that time. The CTC loss function (PDF link) considers all alignments of the audio to the transcription at the same time, allowing us to maximize the probability of the correct transcription being predicted without worrying about alignment. Finally, we train using the Adam optimizer. The data Supervised learning requires data, lots and lots of it. Training a model like Deep Speech requires thousands of hours of labeled audio, and obtaining and preparing this data can be as much work, if not more, as implementing the network and the training logic. We started by downloading freely available speech corpora like TED-LIUM and LibriSpeech,, as well as acquiring paid corpora like Fisher and Switchboard. We wrote importers in Python for the different data sets that convert the audio files to WAV, split the audio and cleaned up the transcription of unneeded characters like punctuation and accents. Finally we stored the preprocessed data in CSV files that can be used to feed data into the network. Using existing speech corpora allowed us to quickly start working on the model. But in order to achieve excellent results, we needed a lot more data. We had to be creative. We thought that maybe this type of speech data would already exist out there, sitting in people’s archives, so we reached out to public TV and radio stations, language study departments in universities, and basically anyone who might have labeled speech data to share. Through this effort, we were able to more than double the amount of training data we had to work with, which is now enough for training a high-quality English model. Having a high-quality voice corpus publicly available not only helps advance our own speech recognition engine. It will eventually allow for broad innovation because developers, startups and researchers around can train and experiment with different architectures and models for different languages. It could help democratize access to Deep Learning for those who can’t afford to pay for thousands of hours of training data (almost everyone). To build a speech corpus that’s free, open source, and big enough to create meaningful products with, we worked with Mozilla’s Open Innovation team and launched the Common Voice project to collect and validate speech contributions from volunteers all over the world. Today, the team is releasing a large collection of voice data into the public domain. Find out more about the release on the Open Innovation Medium blog. The hardware Deep Speech has over 120 million parameters, and training a model this large is a very computationally expensive task: you need lots of GPUs if you don’t want to wait forever for results. We looked into training on the cloud, but it doesn’t work financially: dedicated hardware pays for itself quite quickly if you do a lot of training. The cloud is a good way to do fast hyperparameter explorations though, so keep that in mind. We started with a single machine running four Titan X Pascal GPUs, and then bought another two servers with 8 Titan XPs each. We run the two 8 GPU machines as a cluster, and the older 4 GPU machine is left independent to run smaller experiments and test code changes that require more compute power than our development machines have. This setup is fairly efficient, and for our larger training runs we can go from zero to a good model in about a week. Setting up distributed training with TensorFlow was an arduous process. Although it has the most mature distributed training tools of the available deep learning frameworks, getting things to actually work without bugs and to take full advantage of the extra compute power is tricky. Our current setup works thanks to the incredible efforts of my colleague Tilman Kamp, who endured long battles with TensorFlow, Slurm, and even the Linux kernel until we had everything working. Putting it all together At this point, we have two papers to guide us, a model implemented based on those papers, the resulting data, and the hardware required for the training process. It turns out that replicating the results of a paper isn’t that straightforward. The vast majority of papers don’t specify all the hyperparameters they use, if they specify any at all. This means you have to spend a whole lot of time and energy doing hyperparameter searches to find a good set of values. Our initial tests with values chosen through a mix of randomness and intuition weren’t even close to the ones reported by the paper, probably due to small differences in the architecture — for one, we used LSTM (Long short-term memory) cells instead of GRU (gated recurrent unit) cells. We spent a lot of time doing a binary search on dropout ratios, we reduced the learning rate, changed the way the weights were initialized, and experimented with the size of the hidden layers as well. All of those changes got us pretty close to our desired target of <10% Word Error Rate, but not there. One piece missing from our code was an important optimization: integrating our language model into the decoder. The CTC (Connectionist Temporal Classification) decoder works by taking the probability matrix that is output by the model and walking over it looking for the most likely text sequence according to the probability matrix. If at time step 0 the letter “C” is the most likely, and at time step 1 the letter “A” is the most likely, and at time step 2 the letter “T” is the most likely, then the transcription given by the simplest possible decoder will be “CAT”. This strategy is called greedy decoding. This is a pretty good way of decoding the probabilities output by the model into a sequence of characters, but it has one major flaw: it only takes into account the output of the network, which means it only takes into account the information from audio. When the same audio has two equally likely transcriptions (think “new” vs “knew”, “pause” vs “paws”), the model can only guess at which one is correct. This is far from optimal: if the first four words in a sentence are “the cat has tiny”, we can be pretty sure that the fifth word will be “paws” rather than “pause”. Answering those types of questions is the job of a language model, and if we could integrate a language model into the decoding phase of our model, we could get way better results. When we first tried to tackle this issue, we ran into a couple of blockers in TensorFlow: first, it doesn’t expose its beam scoring functionality in the Python API (probably for performance reasons); and second, the log probabilities output by the CTC loss function were (are?) invalid. We decided to work around the problem by building something like a spell checker instead: go through the transcription and see if there are any small modifications we can make that increase the likelihood of that transcription being valid English, according to the language model. This did a pretty good job of correcting small spelling mistakes in the output, but as we got closer and closer to our target error rate, we realized that it wasn’t going to be enough. We’d have to bite the bullet and write some C++. Beam scoring with a language model Integrating the language model into the decoder involves querying the language model every time we evaluate an addition to the transcription. Going back to the previous example, when looking into whether we want to choose “paws” or “pause” for the next word after “the cat has tiny”, we query the language model and use that score as a weight to sort the candidate transcriptions. Now we get to use information not just from audio but also from our language model to decide which transcription is more likely. The algorithm is described in this paper by Hannun et. al. Luckily, TensorFlow does have an extension point on its CTC beam search decoder that allows the user to supply their own beam scorer. This means all you have to do is write the beam scorer that queries the language model and plug that in. For our case, we wanted that functionality to be exposed to our Python code, so we also exposed it as a custom TensorFlow operation that can be loaded using tf.load_op_library. Getting all of this to work with our setup required quite a bit of effort, from fighting with the Bazel build system for hours, to making sure all the code was able to handle Unicode input in a consistent way, and debugging the beam scorer itself. The system requires quite a few pieces to work together: The language model itself (we use KenLM for building and querying). A trie of all the words in our vocabulary. An alphabet file that maps integer labels output by the network into characters. Although adding this many moving parts does make our code harder to modify and apply to different use cases (like other languages), it brings great benefits: Our word error rate on LibriSpeech’s test-clean set went from 16% to 6.5%, which not only achieves our initial goal, but gets us close to human level performance (5.83% according to the Deep Speech 2 paper). On a MacBook Pro, using the GPU, the model can do inference at a real-time factor of around 0.3x, and around 1.4x on the CPU alone. (A real-time factor of 1x means you can transcribe 1 second of audio in 1 second.) It has been an incredible journey to get to this place: the initial release of our model! In the future we want to release a model that’s fast enough to run on a mobile device or a Raspberry Pi. If this type of work sounds interesting or useful to you, come check out our repository on GitHub and our Discourse channel. We have a growing community of contributors and we’re excited to help you create and publish a model for your language. [Less]
Posted over 6 years ago by Michael Henretty
Since the launch of Common Voice, we have collected hundreds of thousands of voice samples through our website and iOS app. Today, we are releasing a first version of that voice collection into the public domain.From our beginning, Mozilla has relied ... [More] on the creativity, compassion, and resourcefulness of people all over the world to help us build and promote the web as a global public resource accessible to all. This has been the foundation of our experimental work in the field of machine learning and speech recognition, and in building a large, high-quality voice data resource with Common Voice.This collection contains nearly 400,000 recordings from 20,000 different people, resulting in around 500 hours of speech. To date it is already the second largest publicly available voice dataset that we know about, and people around the world are adding and validating new samples all the time!You can go download the data right now!The Common Voice Download PageHaving ourselves experienced how difficult it can be to find publicly available data for our speech technology work, we also provide links to all the other large voice collections we know about on the site. And we are eager to continue growing the website as a central hub for voice data.When we look at today’s voice ecosystem, we see many developers, makers, startups, and researchers who want to experiment with and build voice-enabled technologies. But most of us only have access to fairly limited collection of voice data; an essential component for creating high-quality speech recognition engines. This voice data can cost upwards of tens of thousands of dollars and is insufficient in scale for creating speech recognition at a level people expect. By providing this new public dataset, we want to help overcome these barriers and make it easier to create new and better speech recognition systems (like our own Deep Speech). We’ve started with English, but we will soon support every language. With our parallel work on an open source speech-to-text engine, we hope to open up speech technology so that more people can get involved, innovate, and compete with the larger players.Are you interested in learning about our open-source speech recognition project “Deep Speech” and how Common Voice data can be used to create better speech recognition products? Reuben Morais from Mozilla’s Machine Learning team just published an article about their “Journey to <10% Word Error Rate”. It provides a compelling summary of the challenges and learnings while working towards the team’s first open-source speech recognition engine model, which has been released today on their github repository!We continue to welcome collaborators on Common Voice. Please reach out with any ideas that you have about how we can work together, to let us know how you are using the data, or to give us feedback on how this project could be more useful.We’d like to thank Mycroft, SNIPS, Bangor University, LibriSpeech, VoxForge, TED-LIUM, Tatoeba.org, Mythic, SAP, and of course all our contributors on github. We couldn’t have made this progress this without you!We are also constantly aiming to improve the quality of our dataset. Head on over to the Common Voice website now and help us verify the recordings which is equally important as donating your voice.Sharing Our Common Voice — Mozilla Releases Second Largest Public Voice Data Set was originally published in Mozilla Open Innovation on Medium, where people are continuing the conversation by highlighting and responding to this story. [Less]
Posted over 6 years ago
Jan Keromnes is a Senior Software Engineer working for the Release Management team on tools and automation and is the lead developer for Janitor.Janitor offers developer environments as a service for Firefox, Servo and other open source projects. It ... [More] uses Cloud9 IDE (front-end), Docker servers (back-end), and is 100% web-based so you can jump straight into fresh on-demand environments that are pre-configured and ready for work, without wasting time setting up yet another local checkout or VM.This newsletter was initially published on the Janitor blog. You can contact the Janitor community on their Discourse forum. Hi there, This is your recurrent burst of good news about the Janitor. Thank you ever so much for being part of this community. It really means a lot. 1) Announcing Windows Environments Janitor is great for quickly fixing platform-specific bugs in your projects, especially if you don’t normally develop on that platform. Today, we only provide Linux containers (Ubuntu 16.04) but many of you asked for native Windows environments on Janitor, so that’s exactly what we plan to give you. We want to make it easy for you to work on all operating systems, without the hassle of setting up a VM or maintaining a dual boot. In fact, you won’t even need to install anything other than a good web browser (like Firefox Quantum) because our Windows environments will be accessible from the web, with a graphical VNC environment, just like our current Linux containers. We’re looking into Windows VMs on Azure and TaskCluster workers on AWS. If Mozilla plays along, you should see Windows environments for Firefox on Janitor within just a few months. (If you can help us get there faster, please let us know here, here or here.) 2) Announcing Janitor 0.0.9 So much has happened this year that it was hard to find time to write about our progress. This version bump was long overdue. Here is a quick rundown of what we did since July: Now serving Cloud9 directly from Janitor (c9.io account no longer required) Made both IDE and VNC load much faster (thanks to better browser caching) Improved Docker proxy allows working in multiple containers at the same time Added the Discourse open source project to Janitor (thanks notriddle) Added janitor.json configuration files to automate your project’s workflows on Janitor (thanks ntim) Added a “Reviews” IDE sidebar with code review comments you need to address (thanks ntim) Added two new Docker servers to our cluster (thanks IRILL for the much needed sponsorship upgrade) Now pulling automated Docker image builds (thanks to Docker Hub and CircleCI) Expanded our API to manage Docker containers (to create / inspect / delete containers and image layers) Created a Docker administration page to efficiently manage our container farm Cleaner UI and more controls in our “Projects” and “Containers” pages (thanks ntim, Coder206 and fbeaufort) Dropped the “The” in “The Janitor” because it’s cleaner (thanks arshad) Refreshed Firefox, Servo and Chromium project logos (thanks Coder206, arshad and ntim) Switched Firefox (hg) from mozilla-central to mozilla-unified (thanks ntim) Upgraded to Git 2.15.0 Upgraded to Mercurial 4.4.1 Upgraded to Clang 5.0 and replaced Gold with LLD (links Firefox 2x faster) Upgraded to Rust 1.22.1 / 1.23.0-nightly (installed via rustup 1.7.0) Upgraded to Node.js 8.9.1 and npm 5.5.1 (now installed via nvm 0.33.6) Upgraded to Ninja 1.8.2 (now with bash completion) Upgraded to rr 5.0.0 Upgraded to the latest Vim 8 and Neovim Installed the latest valgrind (for nbp) Installed the latest tmux (for Paul Rouget) … and many more improvements and bug fixes. 3) Our Cluster Just Got Bigger Janitor is now used by over 400 developers and our hardware was starting to feel small, so IRILL upgraded their sponsorship, growing our cluster to a total of 6 servers (4 Docker hosts, including 3 at IRILL in Versaille and 1 at Mozilla in California, as well as 2 VPS web app hosts at OVH in Gravelines). This means that Janitor now runs on 42 CPUs, 120 GB RAM and 4 TB disk space. Here is a picture of EtienneWan and I manually installing the new servers in IRILL’s data center near Paris. You can really thank IRILL and Sylvestre for keeping us going! In the future we’ll make it much simpler for anyone to join our cluster, in order to accept many more open source projects and developers to Janitor. 4) Janitor Around the World Here are some events we went to, or plan to attend: Watch how Coder206 presented Janitor to Sudbury’s Google Developer Group, with a cool side-by-side comparison hacking on Servo. Come see two Janitor lightning talks at Mozilla’s All Hands in Austin this December, in the Firefox Lightning Talks and Power tools for open source tracks. Come hack on open source software with Janitor at INSA Lyon or 42 in Paris in just a few months (two hackathons to be announced). 5) Last Stretch to Beta 2017 has been such a wild ride. We significantly lowered the barrier to new contributions for several major open source projects, allowing many people to contribute for the first time to Firefox, Chromium, Servo, Thunderbird (and more), and we proved that it was possible to modernize software development at scale. Now we just need to finish a few more things before we can call our Alpha a resounding success. In 2018, Janitor Beta will get us to the next level, with Windows environments (and maybe MacOS too); massive Docker scaling improvements; an open build farm that anyone can join; new open source partnerships; and more radical automation to make software development faster and more fun. More on that very soon. And that’s a wrap for today. How is everything going? We’d love to know! Also our Discourse and IRC channel are great resources to ask questions and learn more about this project. Stay safe, Team Janitor P.S. One more thing: Here is a sneak peek at the beautiful new design that ntim, arshad and notriddle are working on for Janitor. [Less]
Posted over 6 years ago by Daniel Stenberg
The never-ending series of curl releases continued today when we released version 7.57.0. The 171th release since the beginning, and the release that follows 37 days after 7.56.1. Remember that 7.56.1 was an extra release that fixed a few most ... [More] annoying regressions. We bump the minor number to 57 and clear the patch number in this release due to the changes introduced. None of them very ground breaking, but fun and useful and detailed below. 41 contributors helped fix 69 bugs in these 37 days since the previous release, using 115 separate commits. 23 of those contributors were new, making the total list of contributors now contain 1649 individuals! 25 individuals authored commits since the previous release, making the total number of authors 540 persons. The curl web site currently sends out 8GB data per hour to over 2 million HTTP requests per day. Support RFC7616 – HTTP Digest This allows HTTP Digest authentication to use the must better SHA256 algorithm instead of the old, and deemed unsuitable, MD5. This should be a transparent improvement so curl should just be able to use this without any particular new option has to be set, but the server-side support for this version seems to still be a bit lacking. (Side-note: I’m credited in RFC 7616 for having contributed my thoughts!) Sharing the connection cache In this modern age with multi core processors and applications using multi-threaded designs, we of course want libcurl to enable applications to be able to get the best performance out of libcurl. libcurl is already thread-safe so you can run parallel transfers multi-threaded perfectly fine if you want to, but it doesn’t allow the application to share handles between threads. Before this specific change, this limitation has forced multi-threaded applications to be satisfied with letting libcurl has a separate “connection cache” in each thread. The connection cache, sometimes also referred to as the connection pool, is where libcurl keeps live connections that were previously used for a transfer and still haven’t been closed, so that a subsequent request might be able to re-use one of them. Getting a re-used connection for a request is much faster than having to create a new one. Having one connection cache per thread, is ineffective. Starting now, libcurl’s “share concept” allows an application to specify a single connection cache to be used cross-thread and cross-handles, so that connection re-use will be much improved when libcurl is used multi-threaded. This will significantly benefit the most demanding libcurl applications, but it will also allow more flexible designs as now the connection pool can be designed to survive individual handles in a way that wasn’t previously possible. Brotli compression The popular browsers have supported brotli compression method for a while and it has already become widely supported by servers. Now, curl supports it too and the command line tool’s –compressed option will ask for brotli as well as gzip, if your build supports it. Similarly, libcurl supports it with its CURLOPT_ACCEPT_ENCODING option. The server can then opt to respond using either compression format, depending on what it knows. According to CertSimple, who ran tests on the top-1000 sites of the Internet, brotli gets contents 14-21% smaller than gzip. As with other compression algorithms, libcurl uses a 3rd party library for brotli compression and you may find that Linux distributions and others are a bit behind in shipping packages for a brotli decompression library. Please join in and help this happen. At the moment of this writing, the Debian package is only available in experimental. (Readers may remember my libbrotli project, but that effort isn’t really needed anymore since the brotli project itself builds a library these days.) Three security issues In spite of our hard work and best efforts, security issues keep getting reported and we fix them accordingly. This release has three new ones and I’ll describe them below. None of them are alarmingly serious and they will probably not hurt anyone badly. Two things can be said about the security issues this time: 1. You’ll note that we’ve changed naming convention for the advisory URLs, so that they now have a random component. This is to reduce potential information leaks based on the name when we pass these around before releases. 2. Two of the flaws happen only on 32 bit systems, which reveals a weakness in our testing. Most of our CI tests, torture tests and fuzzing are made on 64 bit architectures. We have no immediate and good fix for this, but this is something we must work harder on. 1. NTLM buffer overflow via integer overflow (CVE-2017-8816) Limited to 32 bit systems, this is a flaw where curl takes the combined length of the user name and password, doubles it, and allocates a memory area that big. If that doubling ends up larger than 4GB, an integer overflow makes a very small buffer be allocated instead and then curl will overwrite that. Yes, having user name plus password be longer than two gigabytes is rather excessive and I hope very few applications would allow this. 2. FTP wildcard out of bounds read (CVE-2017-8817) curl’s wildcard functionality for FTP transfers is not a not very widely used feature, but it was discovered that the default pattern matching function could erroneously read beyond the URL buffer if the match pattern ends with an open bracket ‘[‘ ! This problem was detected by the OSS-Fuzz project! This flaw  has existed in the code since this feature was added, over seven years ago. 3. SSL out of buffer access (CVE-2017-8818) In July this year we introduced multissl support in libcurl. This allows an application to select which TLS backend libcurl should use, if it was built to support more than one. It was a fairly large overhaul to the TLS code in curl and unfortunately it also brought this bug. Also, only happening on 32 bit systems, libcurl would allocate a buffer that was 4 bytes too small for the TLS backend’s data which would lead to the TLS library accessing and using data outside of the heap allocated buffer. Next? The next release will ship no later than January 24th 2018. I think that one will as well add changes and warrant the minor number to bump. We have fun pending stuff such as: a new SSH backend, modifiable happy eyeballs timeout and more. Get involved and help us do even more good! [Less]