## January 29, 2017

### Minimum SAT score for college admission: varies by race ?

A 2009 Princeton study showed Asian-Americans had to score 140 points higher on their SATs than whites, 270 points higher than Hispanics and 450 points higher than blacks to have the same chance of admission to leading universities.

A lawsuit filed in 2014 accused Harvard of having a cap on the number of Asian students -- the percentage of Asians in Harvard's student body had remained about 16 percent to 19 percent for two decades even though the Asian-American percentage of the population had more than doubled. In 2016, the Asian American Coalition for Education filed a complaint with the Department of Education against Yale, where the Asian percentage had remained 13 percent to 16 percent for 20 years, as well as Brown and Dartmouth, urging investigation of their admissions practices for similar reasons.

## January 26, 2017

### Cambridge Analytica's psychographic profiling for behavioral microtargeting for election processes

Understand personality, not just demographics. OCEAN model: Openness, Conscientiousness, Extroversion, Agreeableness, Neuroticism.

In a 10 minute presentation at the 2016 Concordia Summit, Mr. Alexander Nix discusses the power of big data in global elections. Cambridge Analytica's revolutionary approach to audience targeting, data modeling, and psychographic profiling has made them a leader in behavioral micro-targeting for election processes around the world.

Cambridge's voter data innovations are built from a traditional five-factor model for gauging personality traits. The company uses ongoing nationwide survey data to evaluate voters in specific regions according to the OCEAN or CANOE factors of openness, conscientiousness, extroversion, agreeableness and neuroticism. The ultimate political application of the modeling system is to craft specific ad messages tailored to voter segments based on how they fall on the five-factor spectrum.

The number-crunching and analytics for Mr. Trump felt more like a "data experiment," said Matthew Oczkowski, head of product at Cambridge Analytica, who led the team for nearly six months.

## May 26, 2016

### Party to violence predicted

The Chicago police, which began creating the Strategic Subject List a few years ago, said they viewed it as in keeping with findings by Andrew Papachristos, a sociologist at Yale, who said that the city's homicides were concentrated within a relatively small number of social networks that represent a fraction of the population in high-crime neighborhoods.

Miles Wernick, a professor at the Illinois Institute of Technology, created the algorithm. It draws, the police say, on variables tied to a person's past behavior, particularly arrests and convictions, to predict who is most likely to become a "party to violence."

The police cited proprietary technology as the reason they would not make public the 10 variables used to create the list, but they said that some examples were questions like: Have you been shot before? Is your "trend line" for crimes increasing or decreasing? Do you have an arrest for weapons?

Dr. Wernick said the model intentionally avoided using variables that could discriminate in some way, like race, gender, ethnicity and geography.

Jonathan H. Lewin, the deputy chief of the Chicago Police Department's technology and records group, said: "This is not designed to replace the human process. This is just designed to inform it."

## May 13, 2016

### Evaluating men and women on different traits, Rate My Professor

Benjamin Schmidt, a professor at Northeastern University, created a searchable database of roughly 14 million reviews from the Rate My Professor site.

Among the words more likely to be used to describe men: smart, idiot, interesting, boring, cool, creepy. And for women: sweet, shrill, warm, cold, beautiful, evil. "Funny" and "corny" were also used more often to describe men, while "organized" and "disorganized" showed up more for women.

In short, Schmidt says, men are more likely to be judged on an intelligence scale, while women are more likely to be judged on a nurturing scale.

"We're evaluating men and women on different traits or having different expectations for individuals who are doing the same job," says Erin Davis, who teaches gender studies at Cornell College.

## May 5, 2016

### Ranking or classifying adjacent words

I came across WordRank -- a fresh new approach to embedding words by looking at it as a ranking problem. In hindsight, this makes sense. In typical language modeling situation, NN based or otherwise, we are interested in this: you have a context cc, and you want to predict which word \hat{w}​w​^
​​ from your vocabulary \SigmaΣ will follow it. Naturally, this can be setup either as a ranking problem or a classification problem. If you are coming from the learning the rank camp, all sorts of bells might be going off at this point, and you might have several good reasons for favoring the ranking formulation. That's exactly what we see in this paper. By setting up word embedding as a ranking problem, you get a discriminative training regimen and built in attention-like capability (more on that later).

-- Summary by Delip Rao.

## April 15, 2016

At first I was pretty proud of myself for messing with Facebook's algorithms. But after a little reflection I couldn't escape the feeling I hadn't really gamed anything. I'd created a joke that a lot of people enjoyed. They signaled their enjoyment, which gave Facebook the confidence to show the enjoyable joke to more people. There was nothing "incorrect" about that fake news being at the top of people's feeds. The system--in its murky recursive glory--did what it was supposed to do. And on the next earnings call Mark Zuckerberg can still boast high user engagement numbers.

## April 7, 2016

### Datausa government data

Hal R. Varian, chief economist of Google, who has no connection to Data USA, called the site "very informative and aesthetically pleasing." The fact the government is making so much data publicly available, he added, is fueling creative work like Data USA.

Data USA embodies an approach to data analysis that will most likely become increasingly common, said Kris Hammond, a computer science professor at Northwestern University. The site makes assumptions about its users and programs those assumptions into its software, he said.

"It is driven by the idea that we can actually figure out what a user is going to want to know when they are looking at a data set," Mr. Hammond said.

Data scientists, he said, often bristle when such limitations are put into the tools they use. But they are the data world's power users, and power users are a limited market, said Mr. Hammond, who is also chief scientist at Narrative Science, a start-up that makes software to interpret data.

## March 24, 2016

### Economists need data

Angus Deaton, this year's winner of the Nobel in economic science, was honored for his rigorous and innovative use of data -- including the collection and use of new surveys on individuals' choices and behaviors -- to measure living standards and guide policy.

House Republicans, for example, have been especially scornful of the decennial census, the nation's most important statistical tool, and the related questionnaire, the American Community Survey. They have placed prohibitive constraints on the Census Bureau, including a mandate that it spend no more on the 2020 census than it spent on the 2010 census, despite inflation, population growth and technological change.

## February 19, 2016

### Delegate math vs Cruz 16 or Hillary 08

Leading proportional states but trailing in winner-take-all states does not add up to victory.

delegate allocation matrix puts Cruz's campaign at a serious disadvantage. For example, if Cruz wins the primary in his home state of Texas by one vote, he'll probably win a handful more delegates than his nearest competitor. By contrast, if Marco Rubio or Trump win Florida by one vote, either would win a whopping 99 more delegates than his nearest competitor

-- David Wasserman, U.S. House editor for the Cook Political Report, via 538.

## January 12, 2016

### Sharp divergence between pay at the most successful companies and also-rans in the same field

Bloom believes inequality is being magnified by technological change and what's known as skills bias, where workers with a particular expertise reap the biggest reward. Neither is amenable to quick fixes.

In Professor Bloom's new paper, which he wrote with David J. Price, a Stanford graduate student, and three other economists -- Jae Song, Fatih Guvenen and Till von Wachter -- the top quarter of 1 percent of Americans appears to be pulling away from the rest.

The essay-scoring competition that just concluded offered a mere $60,000 as a first prize, but it drew 159 teams. At the same time, the Hewlett Foundation sponsored a study of automated essay-scoring engines now offered by commercial vendors. The researchers found that these produced scores effectively identical to those of human graders. Barbara Chow, education program director at the Hewlett Foundation, says: "We had heard the claim that the machine algorithms are as good as human graders, but we wanted to create a neutral and fair platform to assess the various claims of the vendors. It turns out the claims are not hype." ## July 3, 2012 ### Employment discrimination provisions An employer's evidence of a racially balanced workforce will not be enough to disprove disparate impact. Employment discrimination provisions of the act apply to companies with more than 15 employees and define two broad types of discrimination, disparate treatment and disparate impact. Disparate treatment is fairly straightforward: It is illegal to treat someone differently on the basis of race or national origin. For example, an employer cannot refuse to hire an African-American with a criminal conviction but hire a similarly situated white person with a comparable conviction. Disparate impact is more complicated. It essentially means that practices that disproportionately harm racial or ethnic groups protected by the law can be considered discriminatory even if there is no obvious intent to discriminate. In fact, according to the guidance, "evidence of a racially balanced work force will not be enough to disprove disparate impact." EEOC: 1, 2. ## July 1, 2012 ### Axciom, data refinery A bank that wants to sell its best customers additional services, for example, might buy details about those customers' social media, Web and mobile habits to identify more efficient ways to market to them. Or, says Mr. Frankland at Forrester, a sporting goods chain whose best customers are 25- to 34-year-old men living near mountains or beaches could buy a list of a million other people with the same characteristics. The retailer could hire Acxiom, he says, to manage a campaign aimed at that new group, testing how factors like consumers' locations or sports preferences affect responses. But the catalog also offers delicate information that has set off alarm bells among some privacy advocates, who worry about the potential for misuse by third parties that could take aim at vulnerable groups. Such information includes consumers' interests -- derived, the catalog says, "from actual purchases and self-reported surveys" -- like "Christian families," "Dieting/Weight Loss," "Gaming-Casino," "Money Seekers" and "Smoking/Tobacco." Acxiom also sells data about an individual's race, ethnicity and country of origin. "Our Race model," the catalog says, "provides information on the major racial category: Caucasians, Hispanics, African-Americans, or Asians." Competing companies sell similar data. Acxiom's data about race or ethnicity is "used for engaging those communities for marketing purposes," said Ms. Barrett Glasgow, the privacy officer, in an e-mail response to questions. There may be a legitimate commercial need for some businesses, like ethnic restaurants, to know the race or ethnicity of consumers, says Joel R. Reidenberg, a privacy expert and a professor at the Fordham Law School. "At the same time, this is ethnic profiling," he says. "The people on this list, they are being sold based on their ethnic stereotypes. There is a very strong citizen's right to have a veto over the commodification of their profile." He says the sale of such data is troubling because race coding may be incorrect. And even if a data broker has correct information, a person may not want to be marketed to based on race. "DO you really know your customers?" Acxiom asks in marketing materials for its shopper recognition system, a program that uses ZIP codes to help retailers confirm consumers' identities -- without asking their permission. "Simply asking for name and address information poses many challenges: transcription errors, increased checkout time and, worse yet, losing customers who feel that you're invading their privacy," Acxiom's fact sheet explains. In its system, a store clerk need only "capture the shopper's name from a check or third-party credit card at the point of sale and then ask for the shopper's ZIP code or telephone number." With that data Acxiom can identify shoppers within a 10 percent margin of error, it says, enabling stores to reward their best customers with special offers. Other companies offer similar services. "This is a direct way of circumventing people's concerns about privacy," says Mr. Chester of the Center for Digital Democracy. ## June 9, 2012 ### SAS financial services modeling The SAS financial services modeling group in San Diego, exploring ways they can take advantage of high-performance analytics and big data techniques to deliver more models, more quickly, and to more customers. This wasn't completely an academic exercise--the team in San Diego has added several new customers recently and have been looking for ways to boost productivity, so this is the perfect setup for our high-performance story. Perhaps you've seen Jim Davis' blog where he ponders what you can do with all the extra time savings that high-performance analytics offers ... provide service to more customers is one good idea! ## June 4, 2012 ### Fab, post Fabulis, is data driven Custora, which also works with sites like Etsy and Revolve Clothing, creates similar online dashboards. But its specialty is identifying the most valuable customer segments and using algorithms to forecast their potential spending over time. Right now, for example, only 15 percent of Fab.com purchasers shop with the company's iPad app. But a Custora forecast estimated that, over the next two years, a typical iPad customer would spend twice as much as a typical Web customer and that the iPad cohort would generate more than 25 percent of Fab.com's revenue. In an era of online behavioral tracking, Fab.com has been more transparent than some other sites about a lot of its customer surveillance, data collection and analysis. Mr. Goldberg writes regularly about the company's social marketing practices and metrics on his blog. Likewise, when Fab.com was seeking seed money last year, Mr. Goldberg gave several venture capital firms passwords to the RJMetrics' dashboard so they could see the company's revenue and customer trends for themselves. "V.C.'s could see it every day," he says. "They could come back and say, 'How did Fab do today?' " Last December, Fab raised$40 million from Andreessen Horowitz, Menlo Ventures, First Round Capital and several other sources, including the actor Ashton Kutcher. Mr. Goldberg, meanwhile, is now an investor in and a board member at RJMetrics.

THIS month, the site even re-engineered its look -- to Fab 3.0 (post Fabulis)-- to capitalize on recent data indicating that users who had checked out the site's crowd-sourcing feature were more likely to make purchases than those who had not. Among other updates, the site now gives more prominence to a live feed featuring the products that members have just bought or liked.

## May 14, 2012

### Quantified-Self

Footsteps, sweat, caffeine, memories, stress, even sex and dating habits - it can all be calculated and scored like a baseball batting average. And if there isn't already an app or a device for tracking it, one will probably appear in the next few years.

Over the last weekend of May, in the upstairs of the Computer History Museum in Mountain View, California, in the heart of Silicon Valley, 400 "Quantified-Selfers" from around the globe have gathered to show off their Excel sheets, databases and gadgets.

-- April Dembosky, FT's San Francisco correspondent

## April 19, 2012

### Cybercrime: overcounted, and as tradegy of the commons

Most cybercrime estimates are based on surveys of consumers and companies. They borrow credibility from election polls, which we have learned to trust. However, when extrapolating from a surveyed group to the overall population, there is an enormous difference between preference questions (which are used in election polls) and numerical questions (as in cybercrime surveys).

For one thing, in numeric surveys, errors are almost always upward: since the amounts of estimated losses must be positive, there's no limit on the upside, but zero is a hard limit on the downside. As a consequence, respondent errors -- or outright lies -- cannot be canceled out. Even worse, errors get amplified when researchers scale between the survey group and the overall population.

Suppose we asked 5,000 people to report their cybercrime losses, which we will then extrapolate over a population of 200 million. Every dollar claimed gets multiplied by 40,000. A single individual who falsely claims $25,000 in losses adds a spurious$1 billion to the estimate. And since no one can claim negative losses, the error can't be canceled.

I.B.M. took a big step to expand its fast-growing stable of data analysis offerings by agreeing on Tuesday to pay $1.2 billion to buy SPSS Inc., a maker of software used in statistical analysis and predictive modeling. Other independent analytics software makers may well become takeover targets, said Mr. Evelson of Forrester. Among the candidates, he said, are Accelrys, Applied Predictive Technologies, Genalytics, InforSense, KXEN and ThinkAnalytics. The broad consolidation wave in business intelligence software, analysts say, will bring increasing price pressure on some segments of the industry as major companies seek to increase their share of the market. And the open-source programming language for data analysis, R, is another source of price pressure on software suppliers. "None of the consolidation purchases we've seen in the business intelligence industry have been fire sales," said Jim Davis, senior vice president of the SAS Institute, a private company based in Cary, N.C., that is the largest supplier of business intelligence and predictive analytics software. ## April 16, 2009 ### Dennis the dentist, 3 The most astonishing change concerns the ending of boys' names. In 1880, most boys' names ended in the letters E, N, D and S. In 1956, the chart of final letters looked pretty much the same, with more names ending in Y. Today's chart looks nothing like the charts of the past century. In 2006, a huge (and I mean huge) percentage of boys' names ended in the letter N. Or as Wattenberg put it, "Ladies and gentlemen, that is a baby-naming revolution." Wattenberg observes a new formality sweeping nursery schools. Thirty years ago there would have been a lot of Nicks, Toms and Bills on the playground. Now they are Nicholas, Thomas and William. In 1898, the name Dewey had its moment (you should be able to figure out why). Today, antique-sounding names are in vogue: Hannah, Abigail, Madeline, Caleb and Oliver. In the late 19th century, parents sometimes named their kids after prestigious jobs, like King, Lawyer, Author and Admiral. Now, children are more likely to bear the names of obsolete proletarian professions, Cooper, Carter, Tyler and Mason. Wattenberg uses her blog to raise vital questions, such as should you give your child an unusual name that is Googleable, or a conventional one that is harder to track? But what's most striking is the sheer variability of the trends she describes. Naming fashion doesn't just move a little. It swings back and forth. People who haven't spent a nanosecond thinking about the letter K get swept up in a social contagion and suddenly they've got a Keisha and a Kody. They may think they're making an individual statement, but in fact their choices are shaped by the networks around them. Furthermore, if you just looked at names, you would conclude that American culture once had a definable core -- signified by all those Anglo names like Mary, Robert, John and William. But over the past few decades, that Anglo core is harder to find. In the world of niche naming, there is no clearly identifiable mainstream. ## April 6, 2009 ### Dennis the dentist rules Still, the couple, like many others, is vulnerable to falling behind again as home prices decline further. But Robert M. Lawless, a law professor at the University of Illinois who favors cram-downs, said success should not be viewed simply "in terms of dollars and cents." -- Lawless law professor on cramdowns. Explanation of the Dennis the dentist rule. ## November 25, 2008 ### Head hurt bayesian They found that Web searches for things like headache and chest pain were just as likely or more likely to lead people to pages describing serious conditions as benign ones, even though the serious illnesses are much more rare. For example, there were just as many results that linked headaches with brain tumors as with caffeine withdrawal, although the chance of having a brain tumor is infinitesimally small. Would such inference be addressed better by a frequentist or bayesian mindset ? ## October 16, 2007 ### Lost in lossy: compression loses information ## June 11, 2007 ### Cash back at closing -- Mortgage fraud ? First he built a dictionary of 150 keywords in real estate ads — “creative financing,” for instance — that might signal a seller’s willingness to play loose. He then looked for instances in which a house had languished on the market and yet wound up selling at or even above the final asking price. In such cases, he found that buyers typically paid a very small down payment; the smaller the down payment, in fact, the higher the price they paid for the house. What could this mean? Either the most highly leveraged buyers were terrible bargainers — or, as Ben-David concluded, such anomalies indicated the artificial inflation that marked a cash-back deal. Having isolated the suspicious transactions in the data, Ben-David could now examine the noteworthy traits they shared. He found that a small group of real estate agents were repeatedly involved, in particular when the seller was himself an agent or when there was no second agent in the deal. Ben-David also found that the suspect transactions were more likely to occur when the lending bank, rather than keeping the mortgage, bundled it up with thousands of others and sold them off as mortgage-backed securities. This suggests that the issuing banks treat suspect mortgages with roughly the same care as you might treat a rental car, knowing that you aren’t responsible for its long-term outcome once it is out of your possession. ## June 1, 2007 ### Epicurean Dealmaker epicureandealmaker on fat tails. Derivatives: Transfering risk or reducing risk ? ## May 30, 2007 ### Interest-rate term-structure pricing models: Riccardo Rebonato Review Paper. Interest-rate term-structure pricing models: a review Riccardo Rebonato Interest-rate term structure modelling from the early short-rate-based models to the current developments; use models for pricing complex derivatives or for relative-value option trading. Therefore, relative-pricing models are given a greater emphasis than equilibrium models. The current state of modelling owes a lot to how models have historically developed in the industry, and stresses the importance of 'technological' developments (such as faster computers or more efficient Monte Carlo techniques) in guiding the direction of theoretical research. The importance of the joint practices of vega hedging and daily model-recalibration is analysed in detail. The relevance of market incompleteness and of the possible informational inefficiency of derivatives markets for calibration and pricing is also discussed. ## April 22, 2007 ### Susan Athey, econometrician, wins Clarke Medal Susan Athey's applied econometrics and heterogeneity of mentorship wins a Clarke medal, awarded to the most accomplished economist nearing 40 and is the most distinguished prize short of a Nobel. Bio. ## March 17, 2007 ### NOAA vs the world Paul K Greed asks about why the anomaly figure is less worrisome than it seems: 1. All weather observers are safely far away from Iraq 2. It's still 72 F and sunny in La Jolla 3. America is still read, white and blue. ## February 15, 2007 ### Prosper lending community Prosper now enjoys some powerfu tools. regression analysis forum adverse selection says avoid the high rate borrowers. Money Walks journal of patterns in data, performance. Erics great survey or lenders: oustandings, return Prosper Analytics animated charts, not quite chart junk. ROI. Prosper's own loan performance database. ## February 6, 2007 ### Infosthetics Infosthetics shows time trends. ## January 24, 2007 ### Many Eyes interactive data visualization Data visualization in web browser, with interaction. New champion: IBM's Many Eyes. Liked by JHeer and radar.oreilly. ## January 20, 2007 ### Atrios's bookshelf Atrios's bookshelf. A former economist, indeed. ## October 6, 2006 ### Visualization and segmentation: Gelman's bag of tricks Visualization and segmentation: Gelman's Bag of tricks for teaching statistics. . ## August 6, 2006 ### Hedging beyond duration and convexity Hedging beyond duration and convexity. By considering a representation using a Fourier-like harmonic, empirical evidence that such a series provides our hedging strategy on a mortgage-backed security (MBS) with the first four principal components of yield curve. ## July 10, 2006 ### Haver data Haver Analystics provides economic data, ready to use in Stata and eViews formats. PCE time series inflationary ? ## July 7, 2006 ### Sparklines time series Show the time series with a sparkline. Sparklines wiki. Go mad with stock charts. US Federal Budget deficit, 1983-2003. ## June 27, 2006 ### Zivot on time series Zivot's class in time series econometrics notes. ## May 25, 2006 ### Unobserved Components Model, Proc UCM Underlying model and several of the features of Proc UCM, new in the Econometrics and Time Series (ETS) module of SAS . Time series data is generated by marketers as they monitor “sales by month” and by medical researchers who collect vital sign information over time. This technique is well suited to modeling the effect of interventions (drug administration or a change in a marketing plan). This new procedure combines the flexibility of Proc ARIMA with the ease of use and interpretability of Smoothing models. UCM does not have the capability to easily model transfer functions, a useful ARIMA function that is planned for Proc UCM. An Animated Guide©: Proc UCM (Unobserved Components Model) Russ Lavery, Contractor for ASG, Inc., PDF ## May 24, 2006 ### Econometric notes Econometric course notes by John Aldrich. ## May 23, 2006 ### SURSE Seemingly unrelated regressions and simulateous equations: PDF ## May 22, 2006 ### Statespace is SAS Statespace in SAS/ETS. The STATESPACE procedure analyzes and forecasts multivariate time series using the state space model. The STATESPACE procedure is appropriate for jointly forecasting several related time series that have dynamic interactions. By taking into account the autocorrelations among the whole set of variables, the STATESPACE procedure may give better forecasts than methods that model each series separately. ## May 15, 2006 ### NumSum spreadsheets on the web Spreadsheets put on the web by NumSum. Like Flickr for accountants. ## April 18, 2006 ### Home value by room count, Miller Samuel Home value by rooms by Miller Samuel. This regression is crying out for a log transformation. And what do all the data points with fractional room counts represent ? * ## April 4, 2006 ### Dashboard spy Dashboard spy gallery of mangement dashboards and consoles full of KPI (Key performance indicators). Update 2006 Dec.: Moved to enterprise-dashboard.com. Ed Tufte adds, ## December 22, 2005 ### Kimberly 'KC' Claffy Kimberly 'KC' Claffy measures internet traffic. ## December 1, 2005 ### log base 2 logbase2 is mostly biostatistics and visualization, with a blast of r. Bonus (detritus ?): And compliant lefty Canadian commentary. ## October 27, 2005 ### Google News Report USA Score Fetch headlines from Google News on a schedule, then rank headlines by factors: * appearance day and time, * prominence on the google news page, * number of appearances, * others; weighted to estimate referer traffic these links bring to their source. Listed are the top scoring stories in recent time periods, followed by a ranking of sources. More detailed reports are linked-to at the bottom of each table. [*] ## October 15, 2005 ### Joint regression analysis Joint regression analysis to study genotype-environmental interaction, genotype effects and/or interaction effects within individual environments are related to environmental effects. The interaction sum of squares is divided into two parts: * one part represents the heterogeneity of linear regression coefficients while * the second represents the pooled deviations from individual regression lines. R. J. (Bob) Baker ## September 8, 2005 ### Hospital Length of Stay: Mean or Median Regression Length of stay (LOS) is an important measure of hospital activity and health care utilization, but its empirical distribution is often positively skewed. Median regression appears to be a suitable alternative to analyze the clustered and positively skewed LOS, without transforming and trimming the data arbitrarily. ## August 29, 2005 ### r graphics (Paul Murrell) is out R Graphics by Paul Murrell shipped. Previously announced. ## August 24, 2005 ### MCMC method bandwidth selection for multivariate kernel density estimation Kernel density estimation for multivariate data is an important technique that has a wide range of applications in econometrics and finance. The lower level of its use is mainly due to the increased difficulty in deriving an optimal data-driven bandwidth as the dimension of data increases. We provide Markov chain Monte Carlo (MCMC) algorithms for estimating optimal bandwidth matrices for multivariate kernel density estimation. Our approach is based on treating the elements of the bandwidth matrix as parameters whose posterior density can be obtained through the likelihood cross-validation criterion. Numerical studies for bivariate data show that the MCMC algorithm generally performs better than the plug-in algorithm under the Kullback-Leibler information criterion. Numerical studies for five dimensional data show that our algorithm is superior to the normal reference rule. ## August 23, 2005 ### Curve Forecasting by Functional Autoregression This paper explores prediction in time series in which the data is generated by a curve-valued autoregression process. It develops a novel technique, the predictive factor decomposition, for estimation of the autoregression operator, which is designed to be better suited for prediction purposes than the principal components method. The technique is based on finding a reduced-rank approximation to the autoregression operator that minimizes the norm of the expected prediction error. The new method is illustrated by an analysis of the dynamics of Eurodollar futures rates term structure. We restrict the sample to the period of normal growth and find that in this subsample the predictive factor technique not only outperforms the principal components method but also performs on par with the best available prediction methods. Curve Forecasting by Functional Autoregression Presenter(s) Alexei Onatski, Columbia University Co-Author(s) Vladislav Kargin, Cornerstone Research Session Chair James Stock, Harvard University ## August 20, 2005 ### Functional data analysis (FDA) Functional data analysis (FDA) handles longitudinal data and treats each observation as a function of time (or other variable). The functions are related. The goal is to analyze a sample of functions instead of a sample of related points. FDA differs from traditional data analytic techniques in a number of ways. Functions can be evaluated at any point in their domain. Derivatives and integrals, which may provide better information (e.g. graphical) than the original data, are easily computed and used in multivariate and other functional analytic methods. S+Functional Data Analysis User's Guide by Douglas B. Clarkson, Chris Fraley, Charles C. Gu, James O. Ramsay Functional Data Analysis (Springer Series in Statistics) (Hardcover) by J. Ramsay, B. W. Silverman Covers topics of linear models, principal components, canonical correlation, and principal differential analysis in function spaces. Applied Functional Data Analysis (Paperback) by J.O. Ramsay, B.W. Silverman Bernard W. Silverman's code site Applied Functional Data Analysis: Methods and Case Studies ## August 19, 2005 ### Mathematical Statistics with MATHEMATICA Mathematical Statistics with MATHEMATICA, Colin Rose, Murray D. Smith (Hardcover) The mathStatica software, an add-on to Mathematica, provides a toolset specially designed for doing mathematical statistics. It enables students to solve difficult problems by removing the technical calculations often associated with mathematical statistics. The professional statistician will be able to tackle tricky multivariate distributions, generating functions, inversion theorems, symbolic maximum likelihood estimation, unbiased estimation, and the checking and correcting of textbook formulas. This text would be a useful companion for researchers and students in statistics, econometrics, engineering, physics, psychometrics, economics, finance, biometrics, and the social sciences. Companion site mathStatica.com ## August 4, 2005 ### Information Visualisation with r Information Visualisation Lecture Slides uses r. ## July 21, 2005 ### Asset prices by Enricode Giorgi Default models and asset pricing models at Enricode Giorgi's resource, some with correlated defaults. ## July 19, 2005 ### sas proc quantreg for quantile regression Some PROC QUANTREG features are: * Implements the simplex, interior point, and smoothing algorithms for estimation * Provides three methods to compute confidence intervals for the regression quantile parameter: sparsity, rank, and resampling. * Provides two methods to compute the covariance and correlation matrices of the estimated parameters: an asymptotic method and a bootstrap method * Provides two tests for the regression parameter estimates: the Wald test and a likelihood ratio test * Uses robust multivariate location and scale estimates for leverage point detection * Multithreaded for parallel computing when multiple processors are available [PDF, *] ## July 17, 2005 ### SAS examples with explanation at ucla.edu/stat/SAS/ SAS examples with explanation abound at UCLA: 1, 2. ## July 10, 2005 ### Array manipulation: Perl Data Language (PDL) and piddles To COMPACTLY store and SPEEDILY manipulate the large N-dimensional data sets which are the bread and butter of scientific computing. e.g.$a=$b+$c can add two
2048x2048 images in only a fraction of a second.

Perl Data Language (PDL), PDL::Impatient - PDL for the impatient

A PDL scalar variable (an instance of a particular class of
perl object, i.e. blessed thingie) is a piddle.

## June 17, 2005

### state of stats

What have we learnt ? State of stats: PDF, Antony Unwin on Statistical Learning.
Global criteria: – AIC, BIC, deviance, test error,...
Local criteria: – residuals, diagnostics

## June 16, 2005

### Support Vector Machine

An SVM corresponds to a linear method in a very high dimensional feature
space which is nonlinearly related to the input space. It does not
involve any computations in that high dimensional space. By the use of
kernels, all necessary computations are performed directly in input space.

are a method for creating functions from a set of labeled training
data. The function can be a classification function (the output is
binary: is the input in a category) or the function can be a general
regression function.

For classification, SVMs operate by finding a hypersurface in the
space of possible inputs. This hypersurface will attempt to split the
positive examples from the negative examples. The split will be chosen
to have the largest distance from the hypersurface to the nearest of
the positive and negative examples. Intuitively, this makes the
classification correct for testing data that is near, but not
identical to the training data.

r (with module e1071):
estimate, predict, example, example2.

Matlab:
Kernel Methods for Pattern Analysis
John Shawe-Taylor & Nello Cristianini
Cambridge University Press, 2004
Detailed contents, inventory of algorithms and kernels, and matlab code.

Stand-alone:

SVM Light is a Support Vector Machine.

## June 15, 2005

### Spectral Graph Transducer, SGTlight

SGTlight is an implementation of a Spectral Graph Transducer (SGT)
[Joachims, 2003] in C using Matlab libraries. The SGT is a method for
transductive learning. It solves a normalized-cut (or ratio-cut) problem
with additional constraints for the labeled examples using spectral
methods. The approach is efficient enough to handle datasets with
several ten-thousands of examples.

## June 14, 2005

### Analysis of patterns

Analysis of patterns

Automatic pattern analysis of data is a pillar of modern science,
technology and business, with deep roots in statistics, machine
learning, pattern recognition, theoretical computer science, and many
other fields. A unified conceptual understanding of this strategic
field is of utmost importance for researchers as well as for users of
this technology.

This workshop - course will emphasizes the common principles and roots
of modern pattern analysis technology, developed independently by many
different scientific communities over the past 30 years, and their
impact on modern science and technology.

Students and researchers from many disciplienes dealing with automatic
pattern analysis form the intended audience. These include (but are
not limited to) statistics, pattern recognition, data mining, machine
learning, information theory, sequence analysis, bioinformatics,

Italy, October 28 - November 6, 2005

## June 13, 2005

### Data mining competition

Fair Isaac and UCSD data mining competition lets you test your predictive power.

## May 3, 2005

### Kalman filter with Mathematica

Kalman filter (An algorithm in control theory introduced by R. Kalman in 1960 and
refined by Kalman and R. Bucy. It is an algorithm which makes optimal use of imprecise
data on a linear (or nearly linear) system with Gaussian errors to continuously update
the best estimate of the system's current state.)

## April 29, 2005

### Decision Science News / Dan Goldstein

Decision Science News by Dan Goldstein and Kevin Flora
about the decision sciences including but not limited to Psychology,
Economics, Business, Medicine, and Law, but
mostly marketing.

Also on Wilmott.

## April 28, 2005

### Statistical Modeling, Causal Inference / MLM

Statistical Modeling, Causal Inference, and Social Science (MLM)
Andrew Gelman and Samantha Cook at Columbia.

## April 27, 2005

### XLISP-Stat estimates Generalised Estimating Equations

XLISP-Stat tools for building Generalised Estimating Equation models
offers an introduction to GEE models.

Much of the brain trust of XLISP Stat has moved on to r.

## April 19, 2005

### r graphics, Paul Murrell

Update 2005 Sept 03: R Graphics is shipping !

A book on the core graphics facilities of the R language and
environment for statistical computing and graphics (to be published
by Chapman & Hall/CRC in August 2005). Preview now.

## March 25, 2005

### Wavelets

Wavelets are mathematical expansions that transform data from the
time domain into different layers of frequency levels. Compared to
standard Fourier analysis, they have the advantage [PDF] of being
localized both in time and in the frequency domain, and enable the
researcher to observe and analyze data at different scales.

## January 31, 2005

### Basel default

Probability of Default (PD)
- the probability that a specific customer will default
within the next 12 months.

Loss Given Default (LGD)
- the percentage of each credit facility that will be lost
if the customer defaults.

- the expected exposure for each credit facility in the
event of a default.

## January 29, 2005

### How Ratings Agencies Achieve Rating Stability

Surveys on the use of agency credit ratings reveal that some
investors believe that rating agencies are relatively slow in
adjusting their ratings. A well-accepted explanation for this
perception on the timeliness of ratings is the "through-the-cycle"
methodology that agencies use. According to Moody's, through-the-cycle
ratings are stable because they are intended to measure the risk of
default risk over long investment horizons, and because they are
changed only when agencies are confident that observed changes in a
company's risk profile are likely to be permanent. To verify this
explanation, we quantify the impact of the long-term default horizon
and the prudent migration policy on rating stability from the
perspective of an investor - with no desire for rating stability. This
is done by benchmarking agency ratings with a financial ratio-based
(credit scoring) agency-rating prediction model and (credit scoring)
default-prediction models of various time horizons. We also examine
rating migration practices. Final result is a better quantitative
understanding of the through-the-cycle methodology.

By varying the time horizon in the estimation of default-prediction
models, we search for a best match with the agency-rating prediction
model. Consistent with the agencies' stated objectives, we conclude
that agency ratings are focused on the long term. In contrast to
one-year default prediction models, agency ratings place less weight
on short-term indicators of credit quality.

We also demonstrate that the focus of agencies on long investment
horizons explains only part of the relative stability of agency
ratings. The other aspect of through-the-cycle rating methodology -
agency rating-migration policy - is an even more important factor
underlying the stability of agency ratings. We find that rating
migrations are triggered when the difference between the actual agency
rating and the model predicted rating exceeds a certain threshold
level. When rating migrations are triggered, agencies adjust their
ratings only partially, consistent with the known serial dependency of
agency rating migrations.

## January 23, 2005

### Treeage statistical software for non-statistician

Features include sensitivity analysis and distribution graphs.

## January 22, 2005

### Belief Networks and Decision Networks

Belief networks (also known as Bayesian networks, Bayes networks and
causal probabilistic networks), provide a method to represent
relationships between propositions or variables, even if the
relationships involve uncertainty, unpredictability or imprecision.

They may be learned automatically from data files, created by an
expert, or developed by a combination of the two. They capture
knowledge in a modular form that can be transported from one situation
to another; it is a form people can understand, and which allows a
clear visualization of the relationships involved.

By adding decision variables (things that can be controlled), and
utility variables (things we want to optimize) to the relationships of
a belief network, a decision network (also known as an influence
diagram) is formed. This can be used to find optimal decisions,
control systems, or plans.

## January 21, 2005

### Agena Risk bayesian network

Agena Risk bayesian network analysis software and whitepapers.

## January 14, 2005

### Bayesian Methods for Improving Credit Scoring Models

Abstract: We propose a Bayesian methodology that enables banks with
small datasets to improve their default probability estimates by
imposing prior information. As prior information, we use coefficients
from credit scoring models estimated on other datasets. Through
simulations, we explore the default prediction power of three Bayesian
estimators in three different scenarios and find that all three
perform better than standard maximum likelihood estimates. We
therefore recommend that banks consider Bayesian estimation for
internal and regulatory default prediction models.

Keywords: Credit Ratings, Rating Agency, Bayesian Inference, Basel II

JEL Classification: C11, G21, G33

## January 13, 2005

ROC.

The ability of a test to discriminate diseased cases from normal cases
is evaluated using Receiver Operating Characteristic (ROC) curve
analysis (Metz, 1978; Zweig & Campbell, 1993). ROC curves can also be
used to compare the diagnostic performance of two or more laboratory or
diagnostic tests (Griner et al., 1981).

## January 7, 2005

### TreeBoost - Stochastic Gradient Boosting

"Boosting" is a technique for improving the accuracy of a predictive
function by applying the function repeatedly in a series and combining
the output of each function with weighting so that the total error of
the prediction is minimized. In many cases, the predictive accuracy of
such a series greatly exceeds the accuracy of the base function used
alone.

## January 2, 2005

### Correlation Monger

Correlation monger provides pair-wise correlation of
demographic variables across 50 US states. For example,

## December 17, 2004

### MedCalc basic statisitical features.

MedCalc has good list of basic statisitical features.

# Stepwise Multiple regression

# Stepwise Logistic regression

# Paired and unpaired t-tests

# Rank sum tests: Wilcoxon test (paired data), Mann-Whitney U test (unpaired data)

# Variance ratio test (F-test)

# One-way analysis of variance (ANOVA) with Student-Newman-Keuls (SNK) test for pairwise comparison of subgroups

# Two-way analysis of variance

# Kruskal-Wallis test

# Frequencies table, crosstabulation analysis, Chi-square test, Chi-square test for trend

# Tests on 2x2 tables: Fisher's exact test, McNemar test

# Frequencies bar charts

# Kaplan-Meier survival curve, logrank test for comparison of survival curves, hazard ratio, logrank test for trend

# Cox proportional-hazards regression

# Meta-analysis: odds ratio (random effects or fixed effects model - Mantel-Heinszel method); summary effects for continuous outcomes; Forest plot

# Reference interval (normal range)

# Analysis of Serial measurements with group comparison

# Bland & Altman plot for method comparison (bias plot) - repeatability

## December 9, 2004

### Combining trees with CART

Salford CART allows one to choose from several ways of combining
separate CART trees into a single predictive engine. The
trees are combined by either averaging their outputs for
regression or by using an unweighted plurality voting scheme
for classification. The current version of CART offers two
combination methods: Bootstrap aggregation and ARCing. Each
generates a set of trees by resampling (with replacement)
from the original training data.

## December 7, 2004

### S-PLUS Predictive Modeling and Computational Finance

S-PLUS Predictive Modeling and Computational Finance
event with abstracts.

Nov 2004 Finance Event Proceedings for LossCalc II: Dynamic Prediction of LGD.
Greg Gupton, Moody's KMV

We describe LossCalc(tm) version 2.0, the Moody's KMV model to predict
loss given default (LGD). LGD is of natural interest to lenders and
investors wishing to estimate future credit losses. LossCalc is a
robust and validated model of LGD for loans and bonds globally.
LossCalc is a statistical model that incorporates information at all levels:
collateral, instrument, firm, industry, country, and the macroeconomy
to predict LGD. Also, and what may be more interesting than merely
having a powerful predictive model, is to see and understand the
underlying drivers of default recovery/loss that we show.

## November 25, 2004

### r project for statistical computing

The r project for statistical computing is an open source companion
to S, S-Plus, successor to XLispStat, and
more.

Whereas SAS and SPSS will give copious output from a regression
or discriminant analysis, R will give minimal output and store the
results in a fit object for subsequent interrogation by further R
functions.

R is an integrated suite of software facilities for data
manipulation, calculation and graphical display. Among
other things it has

* an effective data handling and storage facility,

* a suite of operators for calculations on arrays,
in particular matrices,

* a large, coherent, integrated collection of intermediate
tools for data analysis.

* graphical facilities for data analysis and display
either directly at the computer or on hardcopy.

* a well developed, simple and effective programming
language (called `S') which includes conditionals,
loops, user defined recursive functions and input
and output facilities. (Indeed most of the system
supplied functions are themselves written in the
S language.)