📚 Personal bits of knowledge
at main 115 lines 12 kB view raw view rendered
1# Deep Funding 2 3The goal of [Deep Funding](https://deepfunding.org/) is to develop a system that can allocate resources to public goods with a level of accuracy, fairness, and open access that rivals how private goods are funded by markets, ensuring that high-quality open-source projects can be sustained. Traditional price signals don't exist, so we need "artificial markets" that can simulate the information aggregation properties of real markets while being resistant to the unique failure modes of public goods funding. [Deep Funding is an Impact Evaluator](https://hackmd.io/@dwddao/HypnqpQKke). 4 5In Deep Funding, multiple mechanisms (involving data, mechanism design, and open source) work together. Each layer can be optimized and iterated independently. 6 71. A mechanism that generates an up-to-date and comprehensive DAG of relevant dependencies given a source node 82. A mechanism that fills the graph with relevant weights. These weights represent the latent item utilities. There can be many ways of getting to them! 9 - Aggregating human preferences (polls, pairwise comparisons, ...) 10 - Using prediction markets 11 - Getting weights from an AI model 12 - Collaborative Filtering 13 - Having experts fill weights manually 143. A mechanism that takes that weight vector as input and distributes money to the projects 15 16Deep Funding can be viewed as a [Software-2.0](https://karpathy.medium.com/software-2-0-a64152b37c35) approach to public-goods allocation. Instead of manually designing funding rules, evaluation processes, and governance structures, define an objective function, tests, eval sets, and scoring criteria. Then, let any kind of mechanism (AI models, prediction markets, statistical algorithms, human raters, etc.) compete to solve them. The human work shifts from hand-crafting decision procedures to specifying what "good allocation" looks like and collecting high-quality data. Everything else becomes an optimization problem where participants will try to produce weight predictions that best fit the data. Deep Funding can be seen as **an evolving benchmark suite for truthfully estimating public-goods value**, and progress comes from iterating on the evals rather than hard-coding the system itself. 17 18## Desired Properties 19 20- Credible Neutrality Through Transparent and Simple Mechanisms 21- Comparative Truth-Seeking Over Absolute Metrics 22- Plurality-Aware Preference Aggregation 23- Collusion-Resistant Architecture 24- Practical Scalability 25- Credible Exit and Forkability 26- Works on Public Existing Infrastructure 27- Decentralized and Market-Like Mechanisms to Incentivize Useful Curation 28 - Dependencies reveal themselves through market mechanisms rather than being declared 29 - Skin in the Game. Participants have something to lose from bad assessments 30- Project Independence (no need to participate in the process to get funded) 31 32## Current Approach 33 34So far, Deep Funding has been implemented like this: 35 361. A list of projects is chosen. This is usually provided by an external entity or process (e.g: the [best model from the ML competition](https://cryptopond.xyz/modelfactory/detail/2564617) chooses the next 100 projects). So far a DAG/graph structure has not been needed since all projects have been compared for their impact on the "Ethereum Ecosystem". 37 - In its current shape, the graph's vertices are projects and the edges are the relative impact of each project in its parent. The same approach could be used for [anything that matches the graph](https://x.com/VitalikButerin/status/1981946493780345303) shape (e.g: science research). 382. Jurors do pairwise comparisons between projects. An aggregation method is chosen (Huber loss, L2 norm in log space, ...) to derive the "ground truth" relative project weights. 393. An ML competition and [a Prediction Market](https://ethresear.ch/t/deep-funding-a-prediction-market-for-open-source-dependencies/23101) are kicked off. Modelers and traders are evaluated against a holdout set of pairwise comparisons. 404. Participants are rewarded based on how close they get to the "jurors' ground truth". 41 42### Open Problems 43 44After participating in the ML competition and Prediction Market, and doing a few deep dives into the data and methodology, I think these are the main open problems. 45 46- **Juror Reliability** 47 - So far, expert juror's pairwise comparisons have been inconsistent, noisy, and low in statistical power 48 - Getting comparisons has been quite expensive in time and resources 49 - The jury (secret) pool diversity is not guaranteed 50 - Asking jurors "how much better" introduces order‑dependence and scale mismatch 51 - Messy jurors have [disproportionate impact on the weights](https://davidgasquez.github.io/deepfunding-trial-data-analysis/#-robustness-checks) 52 - Weights are not consistent due to the limited amount of data collected and the variance on it 53 - Large graphs (hundreds of projects) make getting accurate weights from the pairwise evaluation infeasible 54 - E.g. GG24 round has ~100 projects and [would need more than 3000 "actively sampled" comparisons to get to a relative error of 10%](https://arxiv.org/pdf/1505.01462) 55 - This approach/paradigm requires more training examples than jurors can produce in a reasonable span of time 56- **Mechanism Settings** 57 - Some parameters have a large effect and haven't been adjusted 58 - The aggregation formula (huber, log loss, bradley terry, ...) has a very large impact on both modelers/traders and project rewards 59 - Need more process around who chooses the aggregation formula and why it is chosen 60 - In the pilot (huber loss), some projects got weights on a scale jurors didn't feel reasonable (e.g: EIPs repo got 30%) 61 - The prediction market might cause good modelers to not participate as time of entry is more important than having a good model 62 - There might be an incentive to game the market at the last minute 63 - Might be worth it to increase your project share given the money distribution 64- **Weights Evaluation** 65 - [How do we measure success?](https://davidgasquez.com/weight-allocation-mechanism-evals/) If the goal of pattern recognition was to classify objects in a scene, it made sense to score an algorithm by how often it succeeded in doing so. What is the equivalent for Deep Funding? What is the [metric we are optimizing](https://mlhp.stanford.edu/src/chap4.html#sec-metric-elicitation)? 66 - Once the weights are set, there isn't [a process to evaluate how "fit" those are](https://davidgasquez.com/weight-allocation-mechanism-evals/) 67 - E.g: the current idea is to gather a connected graph of pairwise comparisons, why not use that to reward projects directly and skip the Prediction Market? 68 - We need falsifiable hypotheses to validate Deep Funding is "better" 69- **Graph Maintenance** 70 - If the process takes a few weeks, the weights might change significantly (e.g: a project releases a major version) 71 - Jurors are also affected by temporal drift and their preferences evolve over time 72 73## Ideas 74 75- An alternative is to take an ([inspired by RLHF](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf)) approach. **Use only a few significant data points to choose and reward the final models** instead of deriving weights for the entire set of children/dependencies of a project. Resolve the market with only a few, well-tested pairs! 76- Fix weight distributions (Zipf law) and make modelers/jurors focus on predicting the rank. Pick the model that aligns the most with the pairwise data collected. 77 - Win rates can be derived from pairwise comparisons 78- Lean on the [[Pairwise Comparisons]] playbook (binary questions over intensity, active sampling, filtering noisy raters) for any human labeling. 79- Instead of one canonical graph, allow different stakeholder groups (developers, funders, users) to maintain their own weight overlays on the same edge structure. Aggregate these views using quadratic or other mechanisms 80 - If there is a plurality of these "dependency graphs" (or just different set of weights), the funding organization can choose which one to use! The curators gain a % of the money for their service. This creates a market-like mechanism that incentivizes useful curation. 81- Let the dependent set their weight percentage if they're around 82- Let the applicants apply with whatever "abstractions level" they want (e.g: a whole framework, one repository, an entire organization). Rely on pairwise comparisons to resolve conflicts. 83- Have hypercerts or similar. The price of these (total value) sets the weights across dependencies (`numpy`'s certificates trade at 3x the price of a utility library, the edge weight reflects this) 84- If there are reviewers/validators/jurors, need to be public so they have some sort of reputation 85 - Reputation system for Jurors 86 - E.g: whose score is closer to the final one. This biases towards averages 87 - Use graph algorithms ([MaxFlow](https://maxflow.one/how-it-works)) to weight jurors. This trust layer makes all human inputs (juror ratings, edge proposals, curation) sybil-resistant and accountable. People don't get "one vote per account": they get influence proportional to how much the network trusts them, based on who publicly vouches for whom. Vouches are binary and simple, but they're recursively weighted by the voucher's own trust and penalized if someone vouches for too many others, which makes spam and fake networks costly. 88 - Account for jurors' biases with Hierarchical Bradley Terry or similar 89 - Allow anyone to be a juror, select jurors based on their skills 90- Stake-based flow: 91 - Anyone can propose a new edge, and anyone can stake money on that. If they get funding, you get rewarded. Could also be quadratic voting style where you vouch for something. 92 - Should the edge weights/stake decay over time unless refreshed by new attestations? 93 - Quadratic funding or similar for the stake weighting to avoid plutocracy 94 - Anyone can challenge an edge by staking against it 95 - Human attestations from project maintainers or a committee 96- Doing [something similar to Ecosyste.ms](https://blog.ecosyste.ms/2025/04/04/ecosystem-funds-ga.html) might be a better way 97 - A curated set of repositories. You fund that dependency graph + weights. 98 - Could be done looking at the funding or license (if there is a license to declare your deps). 99- Run the mechanism on "eras" / batches so the graph changes and the weights evolve. 100- How to expand to a graph of dependencies that are not only code? 101 - Academic papers and research that influenced design decisions 102 - Cross-language inspiration (e.g., Ruby on Rails influencing web frameworks in other languages) 103 - Standards and specifications that enable interoperability 104- Allow projects to "insure" against their critical dependencies disappearing or becoming unmaintained. This creates a market signal for dependency risk and could fund maintenance of critical-but-boring infrastructure 105- Composable Evaluation Criteria 106 - Rather than a single weighting mechanism, allow different communities to define their own evaluation functions (security-focused, innovation-focused, stability-focused) that can be composed. This enables plurality while maintaining comparability 107- Create a bounty system where anyone can claim rewards for discovering hidden dependencies (similar to bug bounties) 108 - This crowdsources the graph discovery problem and incentivizes thorough documentation. 109- Projects can opt out of the default distribution and declare a custom one for dependencies. Organizers can allow or ignore that 110- Self-declaration needs a "contest process" to resolve issues/abuse 111- Harberger Tax on self declarations? Bayesian Truth Serum for Weight Elicitation? 112 - Projects continuously auction off "maintenance contracts" where funders bid on keeping projects maintained. The auction mechanism reveals willingness-to-pay for continued operation. Dependencies naturally emerge as projects that lose maintenance see their dependents bid up their contracts 113- [Explore Rank Centrality](https://arxiv.org/pdf/1209.1688). Theoretical and empirical results show that with a graph that has a decent spectral gap `O(n log(𝑛))` pair samples suffice for accurate scores and ranking. 114- Report which mechanism is closer (distance metric) to each juror 115- Reward maintainers of the actual projects for pairwise choices