ABSTRACT
Effectively onboarding newcomers is essential for the success of open source projects. These projects often provide onboarding guidelines in their ’CONTRIBUTING’ files (e.g., CONTRIBUTING.md on GitHub). These files explain, for example, how to find open tasks, implement solutions, and submit code for review. However, these files often do not follow a standard structure, can be too large, and miss barriers commonly found by newcomers. In this paper, we propose an automated approach to parse these CONTRIBUTING files and assess how they address onboarding barriers. We manually classified a sample of files according to a model of onboarding barriers from the literature, trained a machine learning classifier that automatically predicts the categories of each paragraph (precision: 0.655, recall: 0.662), and surveyed developers to investigate their perspective of the predictions’ adequacy (75% of the predictions were considered adequate). We found that CONTRIBUTING files typically do not cover the barriers newcomers face (52% of the analyzed projects missed at least 3 out of the 6 barriers faced by newcomers; 84% missed at least 2). Our analysis also revealed that information about choosing a task and talking with the community, two of the most recurrent barriers newcomers face, are neglected in more than 75% of the projects. We made available our classifier as an online service that analyzes the content of a given CONTRIBUTING file. Our approach may help community builders identify missing information in the project ecosystem they maintain and newcomers can understand what to expect in CONTRIBUTING files.
- Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software documentation issues unveiled. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1199–1210. Google ScholarDigital Library
- Amazon. 2023. Amazon Mechanical Turk (Website). https://www.mturk.com/ [Accessed on Aug-2023] Google Scholar
- Apple. 2023. Apple Swift (CONTRIBUTING.md). https://github.com/apple/swift/blob/main/CONTRIBUTING [Accessed on Aug-2023] Google Scholar
- Tasneem Batool, Mostafa Abuelnoor, Omar El Boutari, Fadi Aloul, and Assim Sagahyroon. 2021. Predicting Hospital No-Shows Using Machine Learning. In 2020 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS). 142–148. https://doi.org/10.1109/IoTaIS50849.2021.9359692 Google ScholarCross Ref
- Ismaïl Biskri and Sylvain Delisle. 2002. Text classification and multilinguism: Getting at words via n-grams of characters. In Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI-2002), Orlando (Florida, USA). 5, 110–115. Google Scholar
- Giuseppe Bonaccorso. 2017. 12.2.4.2 Tf-idf Vectorizing. In Machine Learning Algorithms. Packt Publishing. isbn:978-1-78588-962-2 Google Scholar
- Giuseppe Bonaccorso. 2017. 2.1.1.1 One-vs-All. In Machine Learning Algorithms. Packt Publishing. isbn:978-1-78588-962-2 Google Scholar
- Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344. Google ScholarCross Ref
- Norman M Bradburn, Seymour Sudman, and Brian Wansink. 2004. Asking questions: the definitive guide to questionnaire design–for market research, political polls, and social and health questionnaires. John Wiley & Sons. Google Scholar
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33 (2020), 1877–1901. Google Scholar
- Seung Youn Chyung, Katherine Roberts, Ieva Swanson, and Andrea Hankinson. 2017. Evidence-based survey design: The use of a midpoint on the Likert scale. Performance Improvement, 56, 10 (2017), 15–23. Google ScholarCross Ref
- Giuseppe Ciaburro and Prateek Joshi. 2019. 2.9.4 There’s More.... In Python Machine Learning Cookbook (2nd Edition). Packt Publishing. isbn:978-1-78980-845-2 Google Scholar
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20, 1 (1960), 37–46. Google Scholar
- Maria Eduarda Rosa da Silva, Giovani Gracioli, and Gustavo Medeiros de Araujo. 2022. Feature Selection in Machine Learning for Knocking Noise detection. In 2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC). 1–8. https://doi.org/10.1109/SBESC56799.2022.9964726 Google ScholarCross Ref
- Edson Dias, Paulo Meirelles, Fernando Castor, Igor Steinmacher, Igor Wiese, and Gustavo Pinto. 2021. What makes a great maintainer of open source projects? In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 982–994. Google ScholarDigital Library
- James Dominic, Jada Houser, Igor Steinmacher, Charles Ritter, and Paige Rodeghero. 2020. Conversational bot for newcomers onboarding to open source projects. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops. 46–50. Google ScholarDigital Library
- Günhan Dündar and Mustafa Berke Yelten. 2020. 3.6.2 Resampling. In Modelling Methodologies in Analogue Integrated Circuit Design. Institution of Engineering and Technology. isbn:978-1-78561-695-2 Google Scholar
- Omar Elazhary, Margaret-Anne Storey, Neil Ernst, and Andy Zaidman. 2019. Do as i do, not as i say: Do contribution guidelines match the github contribution process? In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). 286–290. Google ScholarCross Ref
- Fronchetti et al.. 2023. Contributing Files (Website). https://contributing.streamlit.app/ [Accessed on Aug-2023] Google Scholar
- Fronchetti et al.. 2023. Replication Package (Zenodo Repository). https://zenodo.org/record/8270217 [Accessed on Aug-2023] Google Scholar
- Facebook. 2023. FastText (Website). https://fasttext.cc/ [Accessed on Aug-2023] Google Scholar
- Fabian Fagerholm, Alejandro Sanchez Guinea, Jay Borenstein, and Jürgen Münch. 2014. Onboarding in open source projects. IEEE Software, 31, 6 (2014), 54–61. Google ScholarCross Ref
- Fabian Fagerholm, Alejandro S Guinea, Jürgen Münch, and Jay Borenstein. 2014. The role of mentoring and project characteristics for onboarding in open source software projects. In Proceedings of the 8th ACM/IEEE international symposium on empirical software engineering and measurement. 1–10. Google ScholarDigital Library
- Matthew Fisher and Frank C Keil. 2016. The curse of expertise: When more knowledge leads to miscalibrated explanatory insight. Cognitive science, 40, 5 (2016), 1251–1269. Google Scholar
- Karl Fogel. 2009. How To Run A Successful Free Software Project - Producing Open Source Software. CreateSpace, Scotts Valley, CA. isbn:1441437711 Google Scholar
- Felipe Fronchetti, Igor Wiese, Gustavo Pinto, and Igor Steinmacher. 2019. What attracts newcomers to onboard on oss projects? tl; dr: Popularity. In IFIP International Conference on Open Source Systems. 91–103. Google ScholarCross Ref
- Davide Fucci, Alireza Mollaalizadehbahnemiri, and Walid Maalej. 2019. On using machine learning to identify knowledge in API reference documentation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 109–119. Google ScholarDigital Library
- Johannes Fürnkranz and Peter A Flach. 2003. An analysis of rule evaluation metrics. In Proceedings of the 20th international conference on machine learning (ICML-03). 202–209. Google Scholar
- GitHub. 2020. GitHub Octoverse. https://octoverse.github.com/credits/ [Accessed on Jun-2023] Google Scholar
- GitHub. 2022. GitHub Flavored Markdown Specs Paragraphs. https://github.com/gfm/##paragraphs [Accessed on Jun-2023] Google Scholar
- GitHub. 2022. Setting guidelines for repository contributors. https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/setting-guidelines-for-repository-contributors [accessed on Jun-2023] Google Scholar
- Google. 2023. Google Sanitizers (GitHub Repository). https://github.com/google/sanitizers [Accessed on Aug-2023] Google Scholar
- Kazi Amit Hasan, Marcos Macedo, Yuan Tian, Bram Adams, and Steven Ding. 2023. Understanding the Time to First Response In GitHub Pull Requests. In Intl. Conference on Mining Software Repositories (MSR 2023). Google ScholarCross Ref
- Hideaki Hata, Taiki Todo, Saya Onoue, and Kenichi Matsumoto. 2015. Characteristics of sustainable oss projects: A theoretical and empirical study. In 2015 IEEE/ACM 8th International Workshop on Cooperative and Human Aspects of Software Engineering. 15–21. Google ScholarDigital Library
- Helena Holmstrom, Eoin Ó Conchúir, J Agerfalk, and Brian Fitzgerald. 2006. Global software development challenges: A case study on temporal, geographical and socio-cultural distance. In 2006 IEEE International Conference on Global Software Engineering (ICGSE’06). 3–11. Google ScholarCross Ref
- Mohammad Hossin and Md Nasir Sulaiman. 2015. A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5, 2 (2015), 1. Google Scholar
- Ammar Ismael Kadhim. 2018. An Evaluation of Preprocessing Techniques for Text Classification. International Journal of Computer Science and Information Security, 16, 6 (2018). Google Scholar
- Frank Kane. 2017. 9.7 TF-IDF. In Hands-on Data Science and Python Machine Learning. Packt Publishing. isbn:978-1-78728-074-8 Google Scholar
- Jacob Krüger, Sebastian Nielebock, and Robert Heumüller. 2020. How Can I Contribute? A Qualitative Analysis of Community Websites of 25 Unix-Like Distributions. In Proceedings of the Evaluation and Assessment in Software Engineering. 324–329. Google ScholarDigital Library
- Imbalanced Learn. 2023. Imbalanced Learn (Website). https://imbalanced-learn.org/stable/ [Accessed on Aug-2023] Google Scholar
- Amanda Lee, Jeffrey C Carver, and Amiangshu Bosu. 2017. Understanding the impressions, motivations, and barriers of one time code contributors to FLOSS projects: a survey. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 187–197. Google ScholarDigital Library
- Timothy C Lethbridge, Janice Singer, and Andrew Forward. 2003. How software engineers use documentation: The state of the practice. IEEE software, 20, 6 (2003), 35–39. Google ScholarDigital Library
- Jiawei Li and Iftekhar Ahmed. 2023. Commit Message Matters: Investigating Impact and Evolution of Commit Message Quality. Google Scholar
- Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1 (ETMTNLP ’02). Association for Computational Linguistics, USA. 63–70. Google ScholarDigital Library
- Yuzhan Ma, Sarah Fakhoury, Michael Christensen, Venera Arnaoudova, Waleed Zogaan, and Mehdi Mirakhorli. 2018. Automatic classification of software artifacts in open-source applications. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR). 414–425. Google ScholarDigital Library
- Ke Mao, Licia Capra, Mark Harman, and Yue Jia. 2017. A survey of the use of crowdsourcing in software engineering. Journal of Systems and Software, 126 (2017), 57–84. Google ScholarCross Ref
- Gerardo Matturro, Karina Barrella, and Patricia Benitez. 2017. Difficulties of newcomers joining software projects already in execution. In 2017 International Conference on Computational Science and Computational Intelligence (CSCI). 993–998. Google ScholarCross Ref
- Christopher Mendez, Hema Susmita Padala, Zoe Steine-Hanson, Claudia Hilderbrand, Amber Horvath, Charles Hill, Logan Simpson, Nupoor Patil, Anita Sarma, and Margaret Burnett. 2018. Open source barriers to entry, revisited: A sociotechnical perspective. In Proceedings of the 40th International conference on software engineering. 1004–1015. Google ScholarDigital Library
- Michael Meng, Stephanie Steinhardt, and Andreas Schubert. 2018. Application programming interface documentation: what do software developers want? Journal of Technical Writing and Communication, 48, 3 (2018), 295–330. Google ScholarCross Ref
- Microsoft. 2023. Microsoft PHPSQL (GitHub Repository). https://github.com/microsoft/msphpsql [Accessed on Aug-2023] Google Scholar
- NVIDIA. 2023. NVIDIA NCCL (GitHub Repository). https://github.com/NVIDIA/nccl [Accessed on Aug-2023] Google Scholar
- Fred Nwanganga and Mike Chapple. 2020. 9.1.1.1 k-Fold Cross-Validation. In Practical Machine Learning in R. John Wiley & Sons. isbn:978-1-119-59151-1 Google Scholar
- Open Source Guides. 2022. Open Source Guides – Starting an Open Source Project. https://opensource.guide/starting-a-project/ [Accessed on Jun-2023] Google Scholar
- OpenAI. 2023. ChatGPT (Website). https://chat.openai.com/ [Accessed on Aug-2023] Google Scholar
- Susmita Hema Padala, Christopher John Mendez, Luiz Felipe Dias, Igor Steinmacher, Zoe Steine Hanson, Claudia Hilderbrand, Amber Horvath, Charles Hill, Logan Dale Simpson, and Margaret Burnett. 2020. How gender-biased tools shape newcomer experiences in oss projects. IEEE Transactions on Software Engineering. Google ScholarDigital Library
- Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado A Visaggio, Gerardo Canfora, and Harald C Gall. 2015. How can i improve my app? classifying user reviews for software maintenance and evolution. In 2015 IEEE international conference on software maintenance and evolution (ICSME). 281–290. Google ScholarDigital Library
- Yunrim Park and Carlos Jensen. 2009. Beyond pretty pictures: Examining the benefits of code visualization for open source newcomers. In 2009 5th IEEE International Workshop on Visualizing Software for Understanding and Analysis. 3–10. Google ScholarCross Ref
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12 (2011), 2825–2830. Google ScholarDigital Library
- Gustavo Pinto, Igor Steinmacher, and Marco Aurélio Gerosa. 2016. More common than you think: An in-depth study of casual contributors. In 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER). 1, 112–123. Google ScholarCross Ref
- Luca Ponzanelli, Gabriele Bavota, Andrea Mocci, Rocco Oliveto, Massimiliano Di Penta, Sonia Haiduc, Barbara Russo, and Michele Lanza. 2017. Automatic identification and classification of software development video tutorial fragments. IEEE Transactions on Software Engineering, 45, 5 (2017), 464–488. Google ScholarDigital Library
- Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atapattu, and David Lo. 2019. Categorizing the content of GitHub README files. Empirical Software Engineering, 24, 3 (2019), 1296–1327. Google ScholarDigital Library
- Pooja Rani, Sebastiano Panichella, Manuel Leuenberger, Andrea Di Sorbo, and Oscar Nierstrasz. 2021. How to identify class comment types? A multi-language approach for class comment classification. Journal of systems and software, 181 (2021), 111047. Google ScholarDigital Library
- Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in github. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. 155–165. Google ScholarDigital Library
- Brittany Reid, Markus Wagner, Marcelo d’Amorim, and Christoph Treude. 2022. Software Engineering User Study Recruitment on Prolific: An Experience Report. arXiv preprint arXiv:2201.05348. Google Scholar
- Martin P Robillard and Yam B Chhetri. 2015. Recommending reference API documentation. Empirical Software Engineering, 20, 6 (2015), 1558–1586. Google ScholarDigital Library
- Martin P Robillard, Andrian Marcus, Christoph Treude, Gabriele Bavota, Oscar Chaparro, Neil Ernst, Marco Aurélio Gerosa, Michael Godfrey, Michele Lanza, and Mario Linares-Vásquez. 2017. On-demand developer documentation. In 2017 IEEE International conference on software maintenance and evolution (ICSME). 479–483. Google ScholarCross Ref
- Fabio Santos, Bianca Trinkenreich, João Felipe Pimentel, Igor Wiese, Igor Steinmacher, Anita Sarma, and Marco A Gerosa. 2022. How to choose a task? Mismatches in perspectives of newcomers and existing contributors. In International Symposium on Empirical Software Engineering and Measurement (ESEM). Google ScholarDigital Library
- CJ Satish and M Anand. 2016. Software documentation management issues and practices: A survey. Indian Journal of Science and Technology, 9, 20 (2016), 1–7. Google ScholarCross Ref
- Scikit-learn. 2023. Cross-validation: evaluating estimator performance (Documentation). https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html [Accessed on Jun-2023] Google Scholar
- Francesco Setragno, Massimiliano Zanoni, Augusto Sarti, and Fabio Antonacci. 2017. Feature-based characterization of violin timbre. In 2017 25th European Signal Processing Conference (EUSIPCO). 1853–1857. https://doi.org/10.23919/EUSIPCO.2017.8081530 Google ScholarCross Ref
- Dan Sholler, Igor Steinmacher, Denae Ford, Mara Averick, Mike Hoye, and Greg Wilson. 2019. Ten simple rules for helping newcomers become contributors to open projects. PLoS computational biology, 15, 9 (2019), e1007296. Google Scholar
- Spacy. 2023. Rule Based Matching (Documentation). https://spacy.io/usage/rule-based-matching [Accessed on Aug-2023] Google Scholar
- Igor Steinmacher, Tayana Conte, Marco Aurélio Gerosa, and David Redmiles. 2015. Social barriers faced by newcomers placing their first contribution in open source software projects. In Proceedings of the 18th ACM conference on Computer supported cooperative work & social computing. 1379–1392. Google ScholarDigital Library
- Igor Steinmacher, Tayana Uchoa Conte, Christoph Treude, and Marco Aurélio Gerosa. 2016. Overcoming Open Source Project Entry Barriers with a Portal for Newcomers. In ICSE ’16. Association for Computing Machinery, New York, NY, USA. 273–284. isbn:9781450339001 Google Scholar
- Igor Steinmacher, Gustavo Pinto, Igor Scaliante Wiese, and Marco A Gerosa. 2018. Almost there: A study on quasi-contributors in open source software projects. In Proceedings of the 40th International Conference on Software Engineering. 256–266. Google ScholarDigital Library
- Igor Steinmacher, Marco Aurélio Graciotto Silva, and Marco Aurélio Gerosa. 2014. Barriers faced by newcomers to open source projects: a systematic review. In IFIP International Conference on Open Source Systems. 153–163. Google ScholarCross Ref
- Igor Steinmacher, Marco Aurelio Graciotto Silva, Marco Aurelio Gerosa, and David F Redmiles. 2015. A systematic literature review on the barriers faced by newcomers to open source software projects. Information and Software Technology, 59 (2015), 67–85. Google ScholarDigital Library
- Igor Steinmacher, Christoph Treude, and Marco Aurelio Gerosa. 2018. Let me in: Guidelines for the successful onboarding of newcomers to open source projects. IEEE Software, 36, 4 (2018), 41–49. Google ScholarCross Ref
- Kathryn T Stolee and Sebastian Elbaum. 2010. Exploring the use of crowdsourcing to support empirical studies in software engineering. In Proceedings of the 2010 ACM-IEEE international symposium on Empirical software engineering and measurement. 1–4. Google ScholarDigital Library
- Xin Tan, Yiran Chen, Haohua Wu, Minghui Zhou, and Li Zhang. 2023. Is It Enough to Recommend Tasks to Newcomers? Understanding Mentoring on Good First Issues. arXiv preprint arXiv:2302.05058. Google Scholar
- Xin Tan, Minghui Zhou, and Zeyu Sun. 2020. A first look at good first issues on GitHub. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 398–409. Google ScholarDigital Library
- Jalaj Thanaki. 2017. 5.3.4.1 Understanding TF-IDF. In Python Natural Language Processing. Packt Publishing. isbn:978-1-78712-142-3 Google Scholar
- Valhalla. 2023. Valhalla (GitHub Repository). https://github.com/valhalla/valhalla [Accessed on Aug-2023] Google Scholar
- Bogdan Vasilescu, Alexander Serebrenik, Prem Devanbu, and Vladimir Filkov. 2014. How social Q&A sites are changing knowledge sharing in open source software communities. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 342–354. Google ScholarDigital Library
- Sathiyamoorthi Velayutham. 2020. 3.5.1 Precision. In Handbook of Research on Applications and Implementations of Machine Learning Techniques. IGI Global. isbn:978-1-5225-9902-9 Google Scholar
- S Vijayarani, Ms J Ilamathi, and Ms Nithya. 2015. Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5, 1 (2015), 7–16. Google Scholar
- April Yi Wang, Dakuo Wang, Jaimie Drozdal, Michael Muller, Soya Park, Justin D Weisz, Xuye Liu, Lingfei Wu, and Casey Dugan. 2022. Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks. ACM Transactions on Computer-Human Interaction, 29, 2 (2022), 1–33. Google ScholarDigital Library
Index Terms
- Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?
Recommendations
Overcoming open source project entry barriers with a portal for newcomers
ICSE '16: Proceedings of the 38th International Conference on Software EngineeringCommunity-based Open Source Software (OSS) projects are usually self-organized and dynamic, receiving contributions from distributed volunteers. Newcomer are important to the survival, long-term success, and continuity of these communities. However, ...
Social Barriers Faced by Newcomers Placing Their First Contribution in Open Source Software Projects
CSCW '15: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social ComputingNewcomers' seamless onboarding is important for online communities that depend upon leveraging the contribution of outsiders. Previous studies investigated aspects of the joining process and motivation in open collaboration communities, but few have ...
Overcoming Social Barriers When Contributing to Open Source Software Projects
An influx of newcomers is critical to the survival, long-term success, and continuity of many Open Source Software (OSS) community-based projects. However, newcomers face many barriers when making their first contribution, leading in many cases to ...
Comments