We no longer support this browser. Using a supported browser will provide a better experience.

update your browser.

Close browser message

研究 Estimating Family Income from Administrative Banking Data

A Machine Learning Approach

澳博官方网站app研究所的成立是为了利用行政银行数据的力量,加深我们对关键经济问题的理解,并为决策者提供及时的见解. 我们最近开发了一种基于机器学习的家庭收入估计方法,以便在我们的研究中获得更深入的见解和改进的代表性. We describe our approach and results in this new release.

Q&A on JPMC 研究所 Income Estimate

我们讨论了这种新的收入估计的动机,以及将机器学习方法应用于行政银行数据的潜力.

什么是澳博官方网站app研究所收入估算(JPMC IIE),为什么澳博官方网站app研究所创建它?

简单地说, 该研究所的收入估算是对经常使用大通银行支票账户的家庭总收入的估算. 分析和理解家庭的财务行为及其在不同收入范围内的差异是该研究所工作的中心主题. To better assess these dynamics, we needed to create a methodology for estimating gross family 收入 across our data sets.

我们将从大通投资组合中获得的见解扩展到美国人口的能力,依赖于拥有或接近一个代表更广泛人口的样本,并能够根据关键属性区分结果, 比如年龄, 收入, 和地理. 例如,如果我们想衡量休斯顿的消费支出增长,就像我们做的那样 Local Consumer Commerce Index, 我们要确保我们在休斯顿观察到的顾客真正代表了这座城市, and we might also want to know who within Houston is contributing most of the growth.

我们知道,大通的投资组合并不能完美地反映美国人口,也不能提供一个了解其客户收入的完美窗口. For example, it inherently excludes the unbanked, who tend to have lower 收入s. Even for banked families, 金融机构可能会看到工资收入进入客户的账户,但不会看到所有的税收减免, 保险, retirement made by the employer. 而且可能还有其他收入来源没有存入客户的账户.

In order to make our samples more representative, 我们必须能够重新调整人口的比重,使之与国家的收入分配相匹配. And in order to study economic behavior of low-收入 families, we want to define low-收入 consistent with national benchmarks. 因此,我们需要一个与人口普查相媲美的收入衡量标准,以便重新加权和基准我们的样本. That's why we chose to create JPMC IIE.

At a high level, what is the methodology behind JPMC IIE?

JPMC IIE背后的想法非常简单,因为它是机器学习中“监督学习”问题的经典应用. For some customers we actually know their gross family 收入, because they applied for a mortgage or credit card with us, and we were required to ask them about their 收入 as part of the underwriting process. These customers represent our “truth set.“在这些客户中,我们可以确定我们观察到的所有客户的哪些特征可以高度预测家庭总收入. 从这个意义上说,我们可以训练一个模型来预测家庭总收入,这个模型使用的是每个人都可以观察到的特征. Once we have tuned that model to be as predictive as possible of the ground truth, we can then deploy it to generate a predicted gross family 收入 for everyone else.

Just how predictive of family 收入 is JPMC IIE?

我们第一个版本的JPMC IIE利用了各种各样的特征来预测家庭总收入, including both account information internal to the bank and publically available data. It is able to come up with a prediction of gross family 收入 that is, 平均, within 41 percent of the truth. 也就是说,平均而言,估计可能比实际高或低41%. This is referred to as the “mean absolute error.”

Since we mostly care about ascertaining a family's 收入 quintile, 我们还根据预测收入与家庭真实收入落入同一五分之一的频率来评估他们的表现. 在这一点上, 预测的五分位数在55%的情况下与真实的五分位数相符,在大约90%的情况下与真实的五分位数相等或接近.

Now, that certainly leaves room for improvement. But let us put those numbers into perspective. 我们是简单地根据居住在同一邮政编码地区的家庭的平均收入来猜测每个家庭的收入吗, according to tax records, we would have been off by 103 percent 平均. So that shows the value of leveraging administrative banking data to predict family 收入.

我们还对JPMC IIE进行了测试,看看如果我们用它来衡量我们的医疗保健自付支出面板中的人口,它的表现如何. 果然, 年龄加权和JPMC IIE使我们的人口比单独按年龄加权更能代表一般人口.

How is this work different from typical 澳博官方网站app 研究所 research, and what were the key lessons learned from it?

This is one of the first applications of machine learning to our work. 除了, whereas most of our research aims to answer concrete research questions, this publication describes the methodology behind a key data asset for us, JPMC IIE, which is foundational for other research.

We certainly learned quite a bit from this exercise. We'll share one key highlight.

团队一开始就很清楚,我们的预测只会和我们的真理集一样强大. 我们需要确保真相集代表了我们试图预测收入的广大客户群体. By relying on mortgage and credit card applicants for the ground truth, our truth set was biased in favor of higher-收入 families, 因此,我们必须对收入较低的抵押贷款和信用卡申请人进行抽样调查. 根据收入对真相进行分层,对收入最低的五分之一家庭的五分之一预测提高了28个百分点.

So what's next for JPMC IIE? 是否有计划继续加强或扩大收入估算和这些方法的范围?

As we mentioned earlier, with a mean absolute error of 41 percent, there is a lot of room for further improvement. We are taking our initial learnings and continuing to improve the model.

我们正忙于完善和增加我们的原始特征,看看我们是否可以提高预测的准确性. 我们还试图通过在银行内寻找更多的客户来扩大我们的真相集的规模,我们为这些客户提供了家庭总收入.

我们还看到了将这种收入估计的范围从支票账户客户扩展到信贷客户的前景,这样我们就可以对所有客户进行统一的预测估计.

我们希望发布我们最初的方法不仅可以让公众了解利用行政银行数据进行预测的力量,还可以为未来的改进产生大量反馈. So reach out to us with ideas and stay tuned!