Generating text inputs for testing mobile applications using GPT-3

A team of researchers from the Chinese Academy of Sciences and Monash University have presented a new approach to generating text input for mobile application testing based on a pre-trained large language model (LLM). Dubbed QTypist, the approach was evaluated on 106 Android applications and automated test tools, showing significant improvements in testing performance.
According to the researchers, one of the main barriers to automating mobile app testing is the need to generate text input, which can be challenging even for human testers. This is due to the fact that different categories of input may be required, including geographic locations, addresses, health measures, and relationships between different inputs required on successive input pages, resulting in validation constraints. Furthermore, as one of the authors of the paper explains Twitterinput in one application view determines which additional views are displayed.
Large language models (LLMs) such as BERT and GPT-3 have been shown to be able to write essays, answer questions, and generate source code. QTypist attempts to leverage the ability of LLMs to understand input instructions from a mobile application to generate meaningful output that is used as text input to the application.
Given a GUI page with text input and its associated view hierarchy file, we first extract context information for text input and design language patterns to generate prompts for input into LLM. To enhance the performance of LLM in mobile input scenarios, we develop a prompt-based data editing and tuning method that automatically constructs prompts and responses for model tuning.
In the first step, QTypist uses a GUI testing tool to extract the context information of the GUI view, including metadata associated with input widgets such as user tips, context information associated with nearby widgets, and global context such as activity name.
The prompt generation step relies on the three categories of extracted information to generate prompts based on patterns determined by language authors working on a set of 500 reference applications.
This process comes up with 14 language samples related to input widget, local context and global context. […]. The input widget patterns specifically define what to input into the widget, and we use keywords such as noun ( widget[n]), verb (widget[v]) and preposition (widget[prep]) to design the sample.
The prompt data set is finally used as input to GPT-3, whose output is used as input content. The effectiveness of this approach was evaluated by comparing it to the baselines of several alternative approaches, including DroidBot, Humanoid, and others, as well as human evaluation of the quality of the generated input. In addition, researchers evaluated the usability of 106 Android apps on Google Play by integrating QTypist with automated testing tools. In all cases, they claim, QTypist was able to improve the performance of existing approaches.
While the initial work of the research team behind QTypist shows promise, further work is needed to extend it to cases where the application does not provide enough context information, and beyond GUI testing.