Automated Question Detection in GitHub Issue Trackers Using Machine Learning

Sunday 23 February 2025


The age-old problem of spam in software issue trackers has plagued developers for years. GitHub, a popular platform for open-source development, is no exception. The sheer volume of issues reported daily makes it challenging for maintainers to identify and prioritize legitimate bugs from noise. A new study aims to tackle this issue by developing an automated system to detect and label questions submitted to GitHub’s issue tracker.


The researchers started with the RapidRelease dataset, which contains over 2 million issues from popular open-source projects on GitHub. By pre-processing the data, they removed noise such as log lines, stack traces, and other irrelevant information. This left them with a dataset of approximately 102,000 issues that were then used to train machine learning models.


The team employed two sentence embedding techniques, Sentence-BERT and Universal Sentence Encoder, to convert the text-based issues into numerical feature vectors. These vectors were then fed into five popular classification algorithms: k-Nearest Neighbors, Decision Tree (C4.5), Random Forest, Support Vector Machine, and Logistic Regression.


The results are promising. The best-performing model, a Logistic Regression algorithm trained on Universal Sentence Encoder embeddings, achieved an accuracy rate of 81.68%. This means that nearly 82% of the issues labeled as questions were correctly identified by the system.


But what about false positives? The study notes that the hardest part is labeling actual questions, and all algorithms struggled with this task. However, the Logistic Regression model showed a high true positive rate for the question category (74.8%).


The implications are significant. By automating the detection of questions in GitHub issue trackers, developers can focus on addressing real bugs rather than wasting time on noise. This could lead to faster resolution times and improved overall development efficiency.


One potential limitation is the reliance on labels provided by developers. The study notes that these labels may not always accurately reflect the type of issue reported. Future work could involve incorporating additional features or data sources to improve the accuracy of the system.


The approach taken in this study offers a promising solution to the problem of spam in software issue trackers. By leveraging machine learning and sentence embedding techniques, developers can create more efficient workflows and improve the overall quality of their projects.


Cite this article: “Automated Question Detection in GitHub Issue Trackers Using Machine Learning”, The Science Archive, 2025.


Github, Issue Tracker, Spam Detection, Machine Learning, Sentence Embedding, Open-Source, Software Development, Automation, Efficiency, Classification Algorithms


Reference: Aidin Rasti, “Labeling questions inside issue trackers” (2024).


Leave a Reply