这一片论文的工作主要集中在探索app上。
“
设计#1:LLM增强型自动应用爬虫。为了提高应用程序遍历效率,我们引入了MobileViews Crawler,它使用固定的交互规则来处理繁琐的应用程序操作,LLM增强了其处理复杂UI状态的能力。在这个工作流程中,MobileViews Crawler首先捕获屏幕的VH,简化它以识别难以处理的UI组件,并按顺序与它们进行交互。当出现人为预定义的触发器时,例如复杂的登录屏幕和固定交互规则无法处理的屏幕空闲,LLM用于理解当前的UI状态并执行绕过这些屏幕的操作。根据我们对Google Play商店中80多个流行移动应用程序的实证研究,MobileViews Crawler通过使用LLM将遍历成功率(即在屏幕上没有被阻止)从18.6%提高到44.2%。人类注释器是补充性的,只涉及处理敏感和必要的操作,如帐户注册和验证码验证。
”
这篇文章收集了来自2万个app的数据!?甚至说AITW的几百个app太片面?
主要贡献如下:
-
We design and implement an automatic, highly parallel mobile app interaction approach to collect high-quality, representative mobile screen datasets with minimal human intervention.
- •
Using this approach, we collect the largest mobile screen dataset to date, covering a diverse range of modern mobile apps and comprehensive screen information. It includes more than 600K screenshot-VH pairs, which is 9× larger than the previous largest dataset. The dataset is open-sourced at https://huggingface.co/datasets/mllmTeam/MobileViews.
- •
We evaluate the collected large-scale dataset on multimodal LLMs that power mobile screen assistants. Compared to the previous largest dataset, it demonstrates significant improvements in enhancing screen assistant capabilities.
App metadata collection. We first collect metadata for over 20K Android apps (e.g., app name, package name) from the Google Play Store. These apps span 33 categories, including many of the most popular apps currently available. We expect that the mobile screens obtained from these modern apps will better represent current mobile screen designs, especially compared to the previous largest dataset, which was collected nearly seven years ago (Deka et al., 2017).
App interaction coordination. Next, the app metadata is sent to the App Interaction Coordinator, which dispatches app traversal tasks to the LLM-enhanced automatic app traversal tool. The App Interaction Coordinator monitors available Android instances in the SoC clusters and connects them to app traversal processes. When an app traversal is completed or unintentionally terminated, it logs the system and app states for future reference.
LLM-enhanced automatic app traversal. The app traversal process starts by installing the corresponding app from the Google Play Store. It then launches the app and traverses its screens to collect data. The traversal process prioritizes rule-based methods, sequentially exploring all interactable UI components. When the current UI state cannot be processed, LLMs are invoked to handle complex UIs, such as login screens. Human intervention is required only when the app interaction coordinator detects an anomaly, such as a screen remaining idle for an extended period.
文章反复强调自己的数据集包括了UI结构信息(就是上图中的VH)而知名数据集AITW没有。
每个app探索1000次就结束。这一点可以参考。
DRoidbot太容易被复杂UI——比如说像登录——干掉了。在这一部分引入了LLM。那么怎么判断系统遇到了难题了呢?:
1) 当应用程序遍历工具遇到包括预定义人为触发器的UI状态时,以及(2)当系统在单个UI状态下长时间处于空闲状态时。
也就是说可以预先设定,也可以通过发现机器卡住来判断。
比如说可以设定:在UI元素中发现登录键的时候就让LLM出手。In the case of handling UI state idleness, we inform the LLM that the goal is to traverse the entire app to capture more UI states, allowing it to take over the app traversal logic and perform actions on the current UI.
那如果遇到了连LLM都处理不了的问题怎么办?就有人来出手。(比如说要验证码,手机号啥的)When the app traversal tool encounters an unprocessable UI state, it records the current UI state and pauses the traversal. Human annotators then handle these apps in a batch manner: they manually open the apps, solve the previous stuck UI states, and then resume the automatic traversal, which significantly reduces their effort.
这篇工作提出的数据集一大特点就是图像重复度低,因为每个app的图像不多,但是参考了很多个不同的app。
这个项目给每个app都搞了专门的账号!太有实力了。
后面还从好多角度证明了他们的数据集更好。呆会看