少点错误 09月26日 03:37
AI抽象学习中的数据集构建挑战
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能研究中,在学习抽象概念时所面临的关键挑战,特别是“数据集构建”问题。作者指出,学习抽象(abstraction-learning)和处理动态结构(truesight)之间存在相互依赖的循环,需要联合学习。数据集构建涉及确定考虑哪些变量、如何采样这些变量(例如,关注特定粒子而非空间点),以及如何选择函数来表示这些变量。这与已有结构下的动态结构推断(truesight)和变量子集选择的启发式方法有关,共同目标是找到能从稳定结构中采样的函数。作者将其类比于人类研究中的定性研究和哲学推理,强调了研究品味和识别有意义分组的重要性。目前,作者对数据集构建的思考尚在初步阶段,但已勾勒出联合解决这三大类问题的整体框架。

🧠 **抽象学习与动态结构学习的循环依赖**:文章指出,AI在学习抽象概念(abstraction-learning)时,需要先解决如何处理变化和重采样结构的问题(truesight),反之亦然。这种相互依赖性要求必须联合学习这两类问题,以打破循环。

🧩 **数据集构建的核心问题**:数据集构建是解决上述循环的关键,它需要解决三个具体问题:1. 确定要考虑哪些变量;2. 决定如何采样这些变量,例如,在学习时固定观察对象(如粒子)而非观察点;3. 选择合适的函数来表示这些变量及其相互作用,可能需要聚合或舍弃部分子变量的信息。

💡 **与现有启发式方法的关联**:数据集构建的启发式方法与处理动态结构(truesight)以及早期提出的选择变量子集以学习协同/冗余信息的方法有重叠。它们共同的目标是找到能够从稳定结构中采样的函数,但数据集构建的区别在于是否已知该稳定结构。

🧑‍🏫 **类比人类认知过程**:作者将“数据集构建”类比于人类的定性研究和哲学推理,认为它对应于研究中的“品味”——即如何识别新环境或领域中的关键特征,以及如何将现实世界中的部分进行有意义的组合与分解,形成新的抽象层级或研究领域。

Published on September 25, 2025 7:21 PM GMT

This is part of a series covering my current research agenda. Refer to the linked post for additional context.

This is going to be a very short part. As I'd mentioned in the initial post, I've not yet done much work on this subproblem.
 


From Part 1, we more or less know how to learn the abstractions given the set of variables over which they're defined. We know their type signature and the hierarchical structure they assemble into, so we can just cast it as a machine-learning problem (assuming a number of practical issues is solved). For clarity, let's dub this problem "abstraction-learning".

From Part 2, we more or less know how to deal with shifting/resampled structures. While the presence of specific abstractions doesn't uniquely lock down what other abstractions are present at higher/lower/sideways levels, we can infer a probability distribution over what abstractions are likely to be there, and then resample from it until finding one that works. Let's call this "truesight".

Except, uh. Part 1 only works given the solution to Part 2's problem, and Part 2 only works given the solution to Part 1's problem. We can't learn abstractions before we've stabilized the structure/attained truesight, but we can't attain truesight until we learn what abstractions we're looking for. We need to, somehow, figure out how to learn them jointly.

This represents the third class of problems we need to solve: figuring out how to transform whatever data we happen to have into datasets for learning new abstractions. Such datasets would need to be isomorphic to samples from the same fixed (at least at a given high level) structure. Assembling them might require:

Call this "dataset-assembly".

Dataset-assembly has some overlap with the truesight problem, so the heuristical machinery for implementing them would be partly shared. In both cases, we're looking for functions over samples of some known variables that effectively sample from the same stable structure. The difference is whether we already known that structure or not.

Another overlap is with the heuristics I'd mentioned in 1.6, the ones for figuring out which subsets of variables to try learning synergistic/redundant-information variables for (instead of doing it for all subsets). Indeed, given the shifting-structures problem, those are actually folded into the heuristics for assembling abstraction-learning datasets!

Introspectively, in humans, "dataset-assembly" is represented by qualitative research as well, and by philosophical reasoning (or at least my model of what "philosophical reasoning" is). "Dataset-assembly heuristics" correspond to research taste, to figuring out what features of some new environment/domain to pay attention to, and which parts of reality could be meaningfully grouped together and decomposed into a new abstract hierarchy/separate field of study.

My thinking on the topic of dataset-assembly is relatively new, and isn't yet refined into a proper model/distilled into something ready for public consumption. Hence, this post is little more than a stub.

That said, I hope the overall picture of the challenge is now clarified. We need to figure out how to set up a process that jointly learns the heuristics for solving these three classes of problems.


What I'm particularly interested in here for the purposes of the bounties is... well, pretty much anything related, since the map is pretty blank. Three core questions:

(My go-to approach in such cases is to figure several practical heuristics, go through a few concrete cases, then attempt to distill general principles/algorithms based on those analyses.)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 机器学习 抽象学习 数据集构建 研究方法 人工智能 Machine Learning Abstraction Learning Dataset Assembly Research Methodology Artificial Intelligence
相关文章