论文发表#

Extreme Value Statistics for General Heterogeneous Data through the Average Tail#

通用异构数据下基于平均尾部的极值统计

Yi He, John H.J. Einmahl

Journal of the American Statistical Association (Theory and Method). Forthcoming. AI Percentile:98%

In the general setting of independent data with possibly very different distributions, extreme value estimators inevitably target the tail of the average distribution function. We consider all possible cases, that is, the extreme value index of the average distribution can be negative, zero, or positive, and we present novel asymptotic theory for the moment estimator. Our results require a different and much more challenging proof than those for the power-law case and are based on a uniform central limit theorem for the underlying weighted tail empirical process. We find that, due to the heterogeneity of the data, the asymptotic variance of the moment estimator can be much smaller than that in the i.i.d. case. We also unravel the improved performance of high quantile and endpoint estimators in this setup. In case of a heavy tail, we ameliorate the Hill estimator by taking an optimal combination of the Hill and the moment estimator. Simulations show the good finite-sample behavior of our limit results. Finally we present applications to the maximum lifespan of monozygotic twins, the ultimate 200m running world records, and to the tail heaviness of energies of earthquakes around the globe.

摘要（豆包中文翻译）：在独立数据且各数据分布可能差异极大的通用框架下，极值估计量必然以平均分布函数的尾部为研究对象。本文全面探讨各类情形，即平均分布的极值指数可为负值、零或正值，并构建了矩估计量全新的渐近理论。相较于幂律分布情形，本文结论的证明思路截然不同且难度更高，研究依托基础加权尾部经验过程的一致中心极限定理展开。研究发现，受数据异质性影响，该矩估计量的渐近方差远小于独立同分布情形下的方差，同时阐明了此框架下高分位数估计量与端点估计量具备更优表现。针对厚尾数据场景，本文将希尔估计量与矩估计量进行最优组合，对传统希尔估计量完成优化改进。数值模拟验证了本文渐近结论在有限样本下表现良好。最后，将研究方法应用于同卵双胞胎最长寿命预测、200 米赛跑极限世界纪录测算以及全球地震能量尾部厚重程度分析之中。

下载文章 (免费)

Accurate Estimates of Ultimate 100-Meter Records#

百米终极纪录精确测算

John H.J. Einmahl, Yi He

Extremes. Forthcoming.

We employ the novel theory of heterogeneous extreme value statistics to accurately estimate the ultimate world records for the 100-m running race, for men and for women. For this aim we collected data from 1991 through 2023 from thousands of top athletes, using multiple fast times per athlete. We consider the left endpoint of the probability distribution of the running times of a top athlete and define the ultimate world record as the minimum, over all top athletes, of all these endpoints. For men we estimate the ultimate world record to be 9.56 seconds. More prudently, employing this heterogeneous extreme value theory we construct an accurate asymptotic 95% lower confidence bound on the ultimate world record of 9.49 seconds, still quite close to the present world record of 9.58. For the women’s 100-meter dash our point estimate of the ultimate world record is 10.34 seconds, somewhat lower than the world record of 10.49. The more prudent 95% lower confidence bound on the women’s ultimate world record is 10.20.

摘要（豆包中文翻译）：本文运用全新的异构极值统计理论，精准测算男女百米赛跑的极限世界纪录。为此，本文收集 1991—2023 年数千名顶尖运动员的赛事数据，纳入每位运动员的多项优异竞赛成绩。研究选取顶尖运动员赛跑用时概率分布的左端点，并将所有运动员该类端点中的最小值界定为极限世界纪录。测算得出男子百米极限世界纪录点估计值为 9.56 秒；依托异构极值理论进一步构建精准的渐近 95% 置信下界，保守估值为 9.49 秒，与当前 9.58 秒的现存世界纪录十分接近。女子百米项目极限世界纪录点估计值为 10.34 秒，低于现行 10.49 秒的世界纪录，其 95% 置信下界保守测算结果为 10.20 秒。

阅读文章

媒体报道（部分列表）:

东方理工公众号荷兰阿姆斯特丹大学官网报道荷兰蒂尔堡大学官网报道 Rtl MSN Parool de Volkskrant Trouw EOS Wetenschap BD AD WK Runnersworld EngineersOnline Newsbomb Vietnam.VN

Ridge prediction under dense factor augmented models#

稠密因子增广模型下的岭回归预测

Yi He

Journal of the American Statistical Association (Theory and Method). 119 (546), 1566–1578, 2024. AI Percentile:98%

This paper establishes a comprehensive theory of the optimality, robustness, and cross-validation selection consistency for the ridge regression under factor-augmented models with possibly dense idiosyncratic information. Using spectral analysis for random matrices, we show that the ridge regression is asymptotically efficient in capturing both factor and idiosyncratic information by minimizing the limiting predictive loss among the entire class of spectral regularized estimators under large-dimensional factor models and mixed-effects hypothesis. We derive an asymptotically optimal ridge penalty in closed form and prove that a bias-corrected k-fold cross-validation procedure can adaptively select the best ridge penalty in large samples. We extend the theory to the autoregressive models with many exogenous variables and establish a consistent cross-validation procedure using the what-we-called double ridge regression method. Our results allow for non-parametric distributions for, possibly heavy-tailed, martingale difference errors and idiosyncratic random coefficients and adapt to the cross-sectional and temporal dependence structures of the large-dimensional predictors. We demonstrate the performance of our ridge estimators in simulated examples as well as an economic dataset. All the proofs are available in the supplement, which also includes more technical discussions and remarks, extra simulation results, and useful lemmas that may be of independent interest.

摘要（豆包中文翻译）：本文构建了含稠密特质信息因子增广模型下岭回归的最优性、稳健性与交叉验证选择相合性完整理论。借助随机矩阵谱分析方法，在高维因子模型与混合效应假设框架下，本文证明岭回归可通过最小化谱正则估计量整体类的极限预测损失，渐近高效提取因子信息与特质信息。本文推导出闭式形式的渐近最优岭惩罚参数，并证实偏差校正 K 折交叉验证法能够在大样本下自适应选取最优岭惩罚值。进一步将该理论拓展至多外生变量自回归模型，提出双岭回归法并建立其相合交叉验证选取机制。研究结论适用于厚尾鞅差分误差与特质随机系数的非参数分布形式，同时可适配高维预测变量的截面相依与时序相依结构。本文通过数值模拟与经济实际数据集验证了所提岭估计量的实际表现。所有证明过程、技术性推导说明、额外模拟结果及具备独立研究价值的引理均收录于附录之中。

下载文章 (免费) MATLAB代码

Extreme value inference for heterogeneous power law data#

异构幂律数据的极值推断

John H.J. Einamhl, Yi He

The Annals of Statistics 51 (3), 1331 - 1356, 2023. AI Percentile: 98%

We extend extreme value statistics to independent data with possibly very different distributions. In particular, we present novel asymptotic normality results for the Hill estimator, which now estimates the extreme value index of the average distribution. Due to the heterogeneity, the asymptotic variance can be substantially smaller than that in the i.i.d. case. As a special case, we consider a heterogeneous scales model where the asymptotic variance can be calculated explicitly. The primary tool for the proofs is the functional central limit theorem for a weighted tail empirical process. A simulation study shows the good finite-sample behavior of our limit theorems. We also present applications to assess the tail heaviness of earthquake energies and of cross-sectional stock market losses.

摘要（豆包中文翻译）：本文将极值统计理论拓展至分布差异显著的独立数据场景，重点推导希尔估计量全新的渐近正态性结论，该估计量可用于估计平均分布的极值指数。受数据异质性影响，其渐近方差远小于独立同分布情形下的方差。作为特例，本文研究异质尺度模型，并实现其渐近方差的显式求解。研究证明主要依托加权尾部经验过程的泛函中心极限定理完成。数值模拟验证了本文渐近定理具备优良的有限样本表现，同时将该方法应用于地震能量尾部厚重程度与截面股市损失尾部特征的实证分析。

下载文章 (免费)

Extreme value estimation for heterogeneous data#

异构数据的极值估计

John H.J. Einmahl and Yi He

Journal of Business & Economic Statistics, 41:1, 255-269, 2023. AI Percentile: 98%

We develop a universal econometric formulation of empirical power laws possibly driven by parameter heterogeneity. Our approach extends classical extreme value theory to specifying the tail behavior of the empirical distribution of a general data set with possibly heterogeneous marginal distributions. We discuss several model examples that satisfy our conditions and demonstrate in simulations how heterogeneity may generate empirical power laws. We observe a cross-sectional power law for US stock losses and show that this tail behavior is largely driven by the heterogeneous volatilities of the individual assets.

摘要（豆包中文翻译）：本文构建可由参数异质性驱动的实证幂律通用计量经济学框架。该方法拓展经典极值理论，用以刻画边缘分布存在异质性的通用数据集经验分布尾部特征。本文列举多项满足研究假设的模型实例，并通过模拟实验阐释异质性催生实证幂律的内在机理。研究发现美股资产亏损存在截面幂律特征，且此类尾部形态主要由各类个股波动率异质性主导形成。

下载文章 (免费) MATLAB代码

Most powerful test against a sequence of high dimensional local alternatives#

针对高维局部备择假设序列的最优功效检验

Yi He, Sombut Jaidee and Jiti Gao

Journal of Econometrics, 234:1, 151-177, 2023. AI Percentile: 96%

We develop a powerful quadratic test for the overall significance of many covariates in a dense regression model in the presence of nuisance parameters. By equally weighting the sample moments, the test is asymptotically correct in high dimensions even when the number of coefficients is larger than the sample size. Our theory allows a non-parametric error distribution and weakly exogenous nuisance variables, in particular autoregressors in many applications. Using random matrix theory, we show that the test has the optimal asymptotic testing power among a large class of competitors against local alternatives whose coordinates are dense in the eigenbasis of the high dimensional sample covariance matrix among regressors. The asymptotic results are adaptive to the covariates’ cross-sectional and temporal dependence structure and do not require a limiting spectral law of their sample covariance matrix. In the most general case, the nuisance estimation may play a role in the asymptotic limit and we give a robust modification for these irregular scenarios. Monte Carlo studies suggest a good power performance of our proposed test against high dimensional dense alternative for various data generating processes. We apply the test to detect the significance of over one hundred exogenous variables in the FRED-MD database for predicting the monthly growth in the US industrial production index.

摘要（豆包中文翻译）：本文针对含多余参数的稠密回归模型，构建一种高效二次检验方法，用于检验大量协变量的整体显著性。该检验通过对样本矩均等赋权，即便系数个数大于样本量，在高维情形下仍具备渐近正确性。本文理论框架允许误差服从非参数分布，同时兼容弱外生多余变量，尤其适用于各类应用场景中的自回归变量。依托随机矩阵理论可证，在回归变量高维样本协方差矩阵特征基下坐标呈稠密分布的局部备择假设下，该检验在同类众多检验方法中拥有最优渐近检验功效。其渐近结论可适配协变量的截面与时序相依结构，且无需假定样本协方差矩阵满足极限谱分布规律。在最一般情形下，多余参数估计会对渐近极限产生影响，本文针对这类非标准情形给出稳健修正方案。蒙特卡洛模拟结果表明，面对多种数据生成过程下的高维稠密备择假设，本文所提检验具备良好检验功效。实证方面，运用该检验对 FRED-MD 数据库中百余个外生变量开展显著性检验，以此研判其对美国工业生产指数月度增速的预测作用。

下载文章 (免费) MATLAB代码

Risk Analysis via Generalized Pareto Distributions#

基于广义帕累托分布的风险分析

Yi He, Liang Peng, Dabao Zhang and Zifeng Zhao

Journal of Business & Economic Statistics, 40:2, 852-867, 2022. AI Percentile: 98%

本文入选 JBES 历来阅读量最高的文章之一

We compute the value-at-risk of financial losses in the tail by fitting a generalized Pareto distribution to exceedances over a high but not divergent threshold. This paper infers such a model for both independent observations and time series data. We show that the asymptotic variance for the maximum likelihood estimation depends on the choice of threshold unlike the existing study of using a divergent threshold. For interval estimation, we propose a random weighted bootstrap method with critical values computed by the empirical distribution of the absolute differences between the bootstrapped estimators and the maximum likelihood estimator. The finite sample performance of the derived confidence intervals is demonstrated through numerical studies before applying to real data in insurance and finance.

摘要（豆包中文翻译）：本文选取较高但非发散阈值，对超阈值数据拟合广义帕累托分布，以此测算金融损失尾部的风险价值。文章针对独立观测数据与时间序列数据分别开展该模型的统计推断。研究发现，与采用发散阈值的现有研究不同，本文极大似然估计的渐近方差会随阈值选取发生变化。区间估计层面，本文提出一种随机加权自助法，依托自助估计量与极大似然估计量差值绝对值的经验分布确定临界值。通过数值模拟验证了所构建置信区间的有限样本表现，并将该方法实际应用于保险与金融领域实证数据分析。

下载文章 (免费) R 代码

Inference for conditional value-at-risk of a predictive regression#

预测回归模型下条件风险价值的统计推断

Yi He, Yanxi Hou, Liang Peng and Haipeng Shen

The Annals of Statistics, 48:6, 3442-3464, 2020. AI Percentile: 98%

Conditional value-at-risk is a popular risk measure in risk management. We study the inference problem of conditional value-at-risk under a linear predictive regression model. We derive the asymptotic distribution of the least squares estimator for the conditional value-at-risk. Our results relax the model assumptions made in Chun et al. (2012) and correct their mistake in the asymptotic variance expression. We show that the asymptotic variance depends on the quantile density function of the unobserved error and whether the model has a predictor with infinite variance, which makes it challenging to actually quantify the uncertainty of the conditional risk measure. To make the inference feasible, we then propose a smooth empirical likelihood based method for constructing a confidence interval for the conditional value-at-risk based on either independent errors or GARCH errors. Our approach not only bypasses the challenge of directly estimating the asymptotic variance but also does not need to know whether there exists an infinite variance predictor in the predictive model. Furthermore, we apply the same idea to the quantile regression method, which allows infinite variance predictors and generalizes the parameter estimation in Whang (2006) to conditional value-at-risk in the supplementary material. We demonstrate the finite sample performance of the derived confidence intervals through numerical studies before applying them to real data.

摘要（豆包中文翻译）：条件风险价值是风险管理领域常用的风险测度指标。本文研究线性预测回归模型下条件风险价值的统计推断问题，推导得到其最小二乘估计量的渐近分布。本文放宽了 Chun 等人（2012）设定的模型假设，并修正了其渐近方差表达式中的错误。研究表明，该渐近方差取决于不可观测误差的分位数密度函数，同时还与模型是否包含无穷方差预测变量相关，这也给条件风险测度的不确定性量化带来较大难度。为实现可行推断，本文分别针对独立误差与 GARCH 误差情形，提出基于光滑经验似然法构建条件风险价值置信区间。该方法既规避了直接估计渐近方差的难题，也无需判定预测模型中是否存在无穷方差预测变量。此外，本文将该思路拓展至分位数回归框架，可适配无穷方差预测变量，并在附录中将 Whang（2006）的参数估计方法推广应用至条件风险价值估计。本文通过数值模拟验证了所得置信区间的有限样本表现，并将其应用于实际数据分析。

下载文章 (免费)

Statistical inference for a relative risk measure#

相对风险测度的统计推断

Yi He, Yanxi Hou, Liang Peng and Jiliang Sheng

Journal of Business & Economic Statistics, 37:2, 301-311, 2019. AI Percentile: 98%

For monitoring systemic risk from regulators’ point of view, this article proposes a relative risk measure, which is sensitive to the market comovement. The asymptotic normality of a nonparametric estimator and its smoothed version is established when the observations are independent. To effectively construct an interval without complicated asymptotic variance estimation, a jackknife empirical likelihood inference procedure based on the smoothed nonparametric estimation is provided with a Wilks type of result in case of independent observations. When data follow from AR-GARCH models, the relative risk measure with respect to the errors becomes useful and so we propose a corresponding nonparametric estimator. A simulation study and real-life data analysis show that the proposed relative risk measure is useful in monitoring systemic risk.

摘要（豆包中文翻译）：从监管机构视角出发，本文提出一种对市场联动性高度敏感的相对风险测度，用于系统性风险监测。在观测数据相互独立的条件下，证明了该非参数估计量及其平滑形式的渐近正态性。为规避复杂的渐近方差估算、便捷构建区间估计，本文基于平滑非参数估计，构建刀切经验似然推断方法，并在独立数据情形下推导得出威尔克斯型统计结论。针对服从自回归广义自回归条件异方差（AR-GARCH）模型的数据，面向误差项构建实用相对风险测度，并给出对应的非参数估计方法。数值模拟与实证分析均证实，本文所提相对风险测度可有效用于系统性风险监测。

下载文章 R代码

Estimation of extreme depth-based quantile regions#

基于极端深度的分位数区域估计

Yi He and John H.J. Einmahl

Journal of the Royal Statistical Society - Series B , 79:449-461, 2017. AI Percentile: 99%

Consider the extreme quantile region induced by the half‐space depth function HD of the form $\mathcal{Q}=\{x\in\mathbb{R}^d:HD(x,P)\leq \beta\},$ such that $P\mathcal{Q}=p$ for a given, very small p>0. Since this involves extrapolation outside the data cloud, this region can hardly be estimated through a fully non‐parametric procedure. Using extreme value theory we construct a natural semiparametric estimator of this quantile region and prove a refined consistency result. A simulation study clearly demonstrates the good performance of our estimator. We use the procedure for risk management by applying it to stock market returns.

摘要（豆包中文翻译）：本文研究由半空间深度函数HD生成的极端分位数区域 $\mathcal{Q}=\{x\in\mathbb{R}^d:HD(x,P)\leq \beta\},$ 满足给定极小正数 p>0 时，概率 $P\mathcal{Q}=p$ 。由于该区域涉及数据团外部外推，难以通过纯非参数方法完成估计。本文借助极值理论构建该分位数区域的自然半参数估计量，并推导得到精细化相合性结论。数值模拟充分验证了该估计量具备优良表现，同时将此方法应用于股市收益率数据，实现其在风险管理领域的实际运用。

下载文章 (免费) MATLAB代码