Lost in lossy: compression loses information
Jpeg: image compression artifacts.
Jpeg: image compression artifacts.
First he built a dictionary of 150 keywords in real estate
ads — “creative financing,” for instance — that might
signal a seller’s willingness to play loose. He then looked
for instances in which a house had languished on the
market and yet wound up selling at or even above the
final asking price. In such cases, he found that buyers
typically paid a very small down payment; the smaller
the down payment, in fact, the higher the price they
paid for the house. What could this mean?
Either the most highly leveraged buyers were terrible
bargainers — or, as Ben-David concluded, such anomalies
indicated the artificial inflation that marked a cash-back deal.
Having isolated the suspicious transactions in the data,
Ben-David could now examine the noteworthy traits they
shared. He found that a small group of real estate agents
were repeatedly involved, in particular when the seller was
himself an agent or when there was no second agent in the
deal. Ben-David also found that the suspect transactions
were more likely to occur when the lending bank, rather
than keeping the mortgage, bundled it up with thousands
of others and sold them off as mortgage-backed securities.
This suggests that the issuing banks treat suspect mortgages
with roughly the same care as you might treat a rental car,
knowing that you aren’t responsible for its long-term outcome
once it is out of your possession.
epicureandealmaker on fat tails.
Derivatives: Transfering risk or reducing risk ?
Review Paper. Interest-rate term-structure pricing models: a review
Riccardo Rebonato
Interest-rate term structure modelling from the early short-rate-based
models to the current developments; use models for pricing complex
derivatives or for relative-value option trading. Therefore, relative-pricing
models are given a greater emphasis than equilibrium models.
The current state of modelling owes a lot to how models have
historically developed in the industry, and stresses the importance
of 'technological' developments (such as faster computers or more
efficient Monte Carlo techniques) in guiding the direction of theoretical
research.
The importance of the joint practices of vega hedging and daily
model-recalibration is analysed in detail. The relevance of market
incompleteness and of the possible informational inefficiency of
derivatives markets for calibration and pricing is also discussed.
Continue reading "Interest-rate term-structure pricing models: Riccardo Rebonato" »
Susan Athey's applied econometrics and heterogeneity of mentorship
wins a Clarke medal, awarded to the most accomplished economist
nearing 40 and is the most distinguished prize short of a Nobel.
Bio.
Paul K Greed asks about why the anomaly figure is
less worrisome than it seems:
1. All weather observers are safely far away from Iraq
2. It's still 72 F and sunny in La Jolla
3. America is still read, white and blue.
Prosper now enjoys some powerfu tools.
regression analysis forum
adverse selection says avoid the high rate borrowers.
Money Walks journal of patterns in data, performance.
Erics great survey or lenders: oustandings, return
Prosper Analytics animated charts, not quite chart junk. ROI.
P2P conventional loan analytics.
Prosper's own loan performance database.
Infosthetics shows time trends.
Data visualization in web browser, with interaction.
New champion: IBM's Many Eyes.
Liked by JHeer and radar.oreilly.
Atrios's bookshelf.
A former economist, indeed.
Visualization and segmentation: Gelman's
Bag of tricks for teaching statistics.
See also Gelman's Data Analysis Using Regression and Multilevel/Hierarchical Models.
Hedging beyond duration and convexity.
By considering a representation using a Fourier-like harmonic,
empirical evidence that such a series provides our hedging
strategy on a mortgage-backed security (MBS) with the first
four principal components of yield curve.
Haver Analystics provides economic data, ready to use
in Stata and eViews formats.
PCE time series inflationary ?
Underlying model and several of the features of Proc UCM, new in the
Econometrics and Time Series (ETS) module of SAS .
Time series data is generated by marketers as they monitor “sales by month”
and by medical researchers who collect vital sign information over time. This
technique is well suited to modeling the effect of interventions (drug administration
or a change in a marketing plan). This new procedure combines the flexibility of
Proc ARIMA with the ease of use and interpretability of Smoothing models.
UCM does not have the capability to easily model transfer functions, a useful
ARIMA function that is planned for Proc UCM.
An Animated Guide©: Proc UCM (Unobserved Components Model)
Russ Lavery, Contractor for ASG, Inc., PDF
Econometric course notes by John Aldrich.
Seemingly unrelated regressions and simulateous equations: PDF
Statespace in SAS/ETS.
The STATESPACE procedure analyzes and forecasts multivariate
time series using the state space model. The STATESPACE procedure
is appropriate for jointly forecasting several related time series that
have dynamic interactions. By taking into account the autocorrelations
among the whole set of variables, the STATESPACE procedure may
give better forecasts than methods that model each series separately.
Spreadsheets put on the web by NumSum.
Like Flickr for accountants.
Home value by rooms by Miller Samuel.
This regression is crying out for a log transformation.
And what do all the data points with fractional room counts represent ?

Dashboard spy gallery of mangement dashboards and consoles full of KPI
(Key performance indicators).
Update 2006 Dec.: Moved to enterprise-dashboard.com.
Ed Tufte adds,
Kimberly 'KC' Claffy measures internet traffic.
Fetch headlines from Google News on a schedule, then rank
headlines by factors:
* appearance day and time,
* prominence on the google news page,
* number of appearances,
* others;
weighted to estimate referer traffic these links bring to their
source.
Listed are the top scoring stories in recent time periods, followed
by a ranking of sources. More detailed reports are linked-to at the
bottom of each table.
[*]
Joint regression analysis to study genotype-environmental interaction,
genotype effects and/or interaction effects within individual
environments are related to environmental effects.
The interaction sum of squares is divided into two parts:
* one part represents the heterogeneity of linear regression
coefficients while
* the second represents the pooled deviations from individual
regression lines.
Length of stay (LOS) is an important measure of hospital activity and
health care utilization, but its empirical distribution is often
positively skewed.
Median regression appears to be a suitable alternative to analyze
the clustered and positively skewed LOS, without transforming and
trimming the data arbitrarily.
Continue reading "Hospital Length of Stay: Mean or Median Regression" »
R Graphics by Paul Murrell shipped.
Previously announced.
Kernel density estimation for multivariate data is an important
technique that has a wide range of applications in econometrics and
finance. The lower level of its use is mainly due to the increased
difficulty in deriving an optimal data-driven bandwidth as the
dimension of data increases. We provide Markov chain Monte Carlo
(MCMC) algorithms for estimating optimal bandwidth matrices for
multivariate kernel density estimation.
Our approach is based on treating the elements of the bandwidth matrix
as parameters whose posterior density can be obtained through the
likelihood cross-validation criterion. Numerical studies for bivariate
data show that the MCMC algorithm generally performs better than the
plug-in algorithm under the Kullback-Leibler information criterion.
Numerical studies for five dimensional data show that our algorithm is
superior to the normal reference rule.
Continue reading "MCMC method bandwidth selection for multivariate kernel density estimation" »
This paper explores prediction in time series in which the data is
generated by a curve-valued autoregression process. It develops a
novel technique, the predictive factor decomposition, for estimation
of the autoregression operator, which is designed to be better suited
for prediction purposes than the principal components method.
The technique is based on finding a reduced-rank approximation to the
autoregression operator that minimizes the norm of the expected
prediction error. The new method is illustrated by an analysis of the
dynamics of Eurodollar futures rates term structure. We restrict the
sample to the period of normal growth and find that in this subsample
the predictive factor technique not only outperforms the principal
components method but also performs on par with the best available
prediction methods.
Curve Forecasting by Functional Autoregression
Presenter(s) Alexei Onatski, Columbia University
Co-Author(s) Vladislav Kargin, Cornerstone Research
Session Chair James Stock, Harvard University
Continue reading "Curve Forecasting by Functional Autoregression" »
Functional data analysis (FDA) handles longitudinal data and treats
each observation as a function of time (or other variable). The
functions are related. The goal is to analyze a sample of functions
instead of a sample of related points.
FDA differs from traditional data analytic techniques in a number of
ways. Functions can be evaluated at any point in their domain.
Derivatives and integrals, which may provide better information (e.g.
graphical) than the original data, are easily computed and used in
multivariate and other functional analytic methods.
S+Functional Data Analysis User's Guide
by Douglas B. Clarkson, Chris Fraley, Charles C. Gu, James O. Ramsay
Functional Data Analysis (Springer Series in Statistics) (Hardcover)
by J. Ramsay, B. W. Silverman
Covers topics of linear models, principal components, canonical
correlation, and principal differential analysis in function spaces.
Applied Functional Data Analysis (Paperback)
by J.O. Ramsay, B.W. Silverman
Bernard W. Silverman's code site Applied Functional Data Analysis: Methods and Case Studies
Mathematical Statistics with MATHEMATICA,
Colin Rose, Murray D. Smith (Hardcover)
The mathStatica software, an add-on to Mathematica, provides a
toolset specially designed for doing mathematical statistics. It
enables students to solve difficult problems by removing the technical
calculations often associated with mathematical statistics. The
professional statistician will be able to tackle tricky multivariate
distributions, generating functions, inversion theorems, symbolic
maximum likelihood estimation, unbiased estimation, and the checking
and correcting of textbook formulas. This text would be a useful
companion for researchers and students in statistics, econometrics,
engineering, physics, psychometrics, economics, finance, biometrics,
and the social sciences.
Companion site mathStatica.com
Information Visualisation Lecture Slides uses r.
Default models and asset pricing models at Enricode Giorgi's resource,
some with correlated defaults.
Some PROC QUANTREG features are:
* Implements the simplex, interior point, and smoothing algorithms for
estimation
* Provides three methods to compute confidence intervals for the
regression quantile parameter: sparsity, rank, and resampling.
* Provides two methods to compute the covariance and correlation
matrices of the estimated parameters: an asymptotic method and a
bootstrap method
* Provides two tests for the regression parameter estimates: the Wald
test and a likelihood ratio test
* Uses robust multivariate location and scale estimates for leverage
point detection
* Multithreaded for parallel computing when multiple processors are
available
To COMPACTLY store and SPEEDILY manipulate the large
N-dimensional data sets which are the bread and butter
of scientific computing. e.g. $a=$b+$c can add two
2048x2048 images in only a fraction of a second.
Perl Data Language (PDL), PDL::Impatient - PDL for the impatient
A PDL scalar variable (an instance of a particular class of
perl object, i.e. blessed thingie) is a piddle.
What have we learnt ? State of stats: PDF, Antony Unwin on Statistical Learning.
Global criteria: – AIC, BIC, deviance, test error,...
Local criteria: – residuals, diagnostics
An SVM corresponds to a linear method in a very high dimensional feature
space which is nonlinearly related to the input space. It does not
involve any computations in that high dimensional space. By the use of
kernels, all necessary computations are performed directly in input space.
are a method for creating functions from a set of labeled training
data. The function can be a classification function (the output is
binary: is the input in a category) or the function can be a general
regression function.
For classification, SVMs operate by finding a hypersurface in the
space of possible inputs. This hypersurface will attempt to split the
positive examples from the negative examples. The split will be chosen
to have the largest distance from the hypersurface to the nearest of
the positive and negative examples. Intuitively, this makes the
classification correct for testing data that is near, but not
identical to the training data.
r (with module e1071):
estimate, predict, example, example2.
Matlab:
Kernel Methods for Pattern Analysis
John Shawe-Taylor & Nello Cristianini
Cambridge University Press, 2004
Detailed contents, inventory of algorithms and kernels, and matlab code.

Stand-alone:
SVM Light is a Support Vector Machine.
SGTlight is an implementation of a Spectral Graph Transducer (SGT)
[Joachims, 2003] in C using Matlab libraries. The SGT is a method for
transductive learning. It solves a normalized-cut (or ratio-cut) problem
with additional constraints for the labeled examples using spectral
methods. The approach is efficient enough to handle datasets with
several ten-thousands of examples.
Automatic pattern analysis of data is a pillar of modern science,
technology and business, with deep roots in statistics, machine
learning, pattern recognition, theoretical computer science, and many
other fields. A unified conceptual understanding of this strategic
field is of utmost importance for researchers as well as for users of
this technology.
This workshop - course will emphasizes the common principles and roots
of modern pattern analysis technology, developed independently by many
different scientific communities over the past 30 years, and their
impact on modern science and technology.
Students and researchers from many disciplienes dealing with automatic
pattern analysis form the intended audience. These include (but are
not limited to) statistics, pattern recognition, data mining, machine
learning, information theory, sequence analysis, bioinformatics,
adaptive systems, etc.
Italy, October 28 - November 6, 2005
Fair Isaac and UCSD data mining competition lets you test your predictive power.
Kalman filter (An algorithm in control theory introduced by R. Kalman in 1960 and
refined by Kalman and R. Bucy. It is an algorithm which makes optimal use of imprecise
data on a linear (or nearly linear) system with Gaussian errors to continuously update
the best estimate of the system's current state.)
As a times series function (example); as an estimator for linear
(time series and panel) models with time-varying coefficients.
Decision Science News by Dan Goldstein and Kevin Flora
about the decision sciences including but not limited to Psychology,
Economics, Business, Medicine, and Law, but
mostly marketing.
Also on Wilmott.
Statistical Modeling, Causal Inference, and Social Science (MLM)
Andrew Gelman and Samantha Cook at Columbia.
XLISP-Stat tools for building Generalised Estimating Equation models
offers an introduction to GEE models.
Much of the brain trust of XLISP Stat has moved on to r.
Continue reading "XLISP-Stat estimates Generalised Estimating Equations" »
Update 2005 Sept 03: R Graphics is shipping !
A book on the core graphics facilities of the R language and
environment for statistical computing and graphics (to be published
by Chapman & Hall/CRC in August 2005). Preview now.
Wavelets are mathematical expansions that transform data from the
time domain into different layers of frequency levels. Compared to
standard Fourier analysis, they have the advantage [PDF] of being
localized both in time and in the frequency domain, and enable the
researcher to observe and analyze data at different scales.
Probability of Default (PD)
- the probability that a specific customer will default
within the next 12 months.
Loss Given Default (LGD)
- the percentage of each credit facility that will be lost
if the customer defaults.
Exposure at Default (EAD)
- the expected exposure for each credit facility in the
event of a default.
Surveys on the use of agency credit ratings reveal that some
investors believe that rating agencies are relatively slow in
adjusting their ratings. A well-accepted explanation for this
perception on the timeliness of ratings is the "through-the-cycle"
methodology that agencies use. According to Moody's, through-the-cycle
ratings are stable because they are intended to measure the risk of
default risk over long investment horizons, and because they are
changed only when agencies are confident that observed changes in a
company's risk profile are likely to be permanent. To verify this
explanation, we quantify the impact of the long-term default horizon
and the prudent migration policy on rating stability from the
perspective of an investor - with no desire for rating stability. This
is done by benchmarking agency ratings with a financial ratio-based
(credit scoring) agency-rating prediction model and (credit scoring)
default-prediction models of various time horizons. We also examine
rating migration practices. Final result is a better quantitative
understanding of the through-the-cycle methodology.
By varying the time horizon in the estimation of default-prediction
models, we search for a best match with the agency-rating prediction
model. Consistent with the agencies' stated objectives, we conclude
that agency ratings are focused on the long term. In contrast to
one-year default prediction models, agency ratings place less weight
on short-term indicators of credit quality.
We also demonstrate that the focus of agencies on long investment
horizons explains only part of the relative stability of agency
ratings. The other aspect of through-the-cycle rating methodology -
agency rating-migration policy - is an even more important factor
underlying the stability of agency ratings. We find that rating
migrations are triggered when the difference between the actual agency
rating and the model predicted rating exceeds a certain threshold
level. When rating migrations are triggered, agencies adjust their
ratings only partially, consistent with the known serial dependency of
agency rating migrations.
Continue reading "How Ratings Agencies Achieve Rating Stability" »
TreeAge offers statistical software for non-statisticians.
Features include sensitivity analysis and distribution graphs.
Belief networks (also known as Bayesian networks, Bayes networks and
causal probabilistic networks), provide a method to represent
relationships between propositions or variables, even if the
relationships involve uncertainty, unpredictability or imprecision.
They may be learned automatically from data files, created by an
expert, or developed by a combination of the two. They capture
knowledge in a modular form that can be transported from one situation
to another; it is a form people can understand, and which allows a
clear visualization of the relationships involved.
By adding decision variables (things that can be controlled), and
utility variables (things we want to optimize) to the relationships of
a belief network, a decision network (also known as an influence
diagram) is formed. This can be used to find optimal decisions,
control systems, or plans.

Agena Risk bayesian network analysis software and whitepapers.
Abstract: We propose a Bayesian methodology that enables banks with
small datasets to improve their default probability estimates by
imposing prior information. As prior information, we use coefficients
from credit scoring models estimated on other datasets. Through
simulations, we explore the default prediction power of three Bayesian
estimators in three different scenarios and find that all three
perform better than standard maximum likelihood estimates. We
therefore recommend that banks consider Bayesian estimation for
internal and regulatory default prediction models.
Keywords: Credit Ratings, Rating Agency, Bayesian Inference, Basel II
JEL Classification: C11, G21, G33
Continue reading "Bayesian Methods for Improving Credit Scoring Models" »
ROC.
The ability of a test to discriminate diseased cases from normal cases
is evaluated using Receiver Operating Characteristic (ROC) curve
analysis (Metz, 1978; Zweig & Campbell, 1993). ROC curves can also be
used to compare the diagnostic performance of two or more laboratory or
diagnostic tests (Griner et al., 1981).
TreeBoost - Stochastic Gradient Boosting.
"Boosting" is a technique for improving the accuracy of a predictive
function by applying the function repeatedly in a series and combining
the output of each function with weighting so that the total error of
the prediction is minimized. In many cases, the predictive accuracy of
such a series greatly exceeds the accuracy of the base function used
alone.
Correlation monger provides pair-wise correlation of
demographic variables across 50 US states. For example,
Canadians increase property values.
MedCalc has good list of basic statisitical features.
# Stepwise Multiple regression
# Stepwise Logistic regression
# Paired and unpaired t-tests
# Rank sum tests: Wilcoxon test (paired data), Mann-Whitney U test (unpaired data)
# Variance ratio test (F-test)
# One-way analysis of variance (ANOVA) with Student-Newman-Keuls (SNK) test for pairwise comparison of subgroups
# Two-way analysis of variance
# Kruskal-Wallis test
# Frequencies table, crosstabulation analysis, Chi-square test, Chi-square test for trend
# Tests on 2x2 tables: Fisher's exact test, McNemar test
# Frequencies bar charts
# Kaplan-Meier survival curve, logrank test for comparison of survival curves, hazard ratio, logrank test for trend
# Cox proportional-hazards regression
# Meta-analysis: odds ratio (random effects or fixed effects model - Mantel-Heinszel method); summary effects for continuous outcomes; Forest plot
# Reference interval (normal range)
# Analysis of Serial measurements with group comparison
# Bland & Altman plot for method comparison (bias plot) - repeatability