7 views

Skip to first unread message

Sep 28, 2006, 11:49:25 AM9/28/06

to

I recently ran into this statement

"Some roadway and traffic variables have a clear effect on the lateral

position of the motorist

during both passing and non-passing events. However, several variables

are only statistically

significant in one of these cases. One regression model may include a

variable that is statistically

insignificant if it is significant in the other in order to maintain

the same independent variables

among the two events and to ultimately generate a measure for the

change in lateral position of

the motorist."

I am not a statistician but this struck me as a bit strange. Can anyone

comment on the idea of keeping a non-significant variable in a model in

order to match another model? I

Sep 28, 2006, 6:15:38 PM9/28/06

to

On 28 Sep 2006 08:49:25 -0700, "John Kane" <jrkr...@gmail.com>

wrote:

wrote:

What makes sense is to keep variables in models because you

expect them to be meaningful.

The too-frequent, common mistake is to drop variables from an

equation merely because they fail to be "statistically significant"

in a particular case.

The question of stepwise regression has comments from years

ago collected in my stats-FAQ.

--

Rich Ulrich, wpi...@pitt.edu

http://www.pitt.edu/~wpilib/index.html

Sep 28, 2006, 9:04:36 PM9/28/06

to

Richard Ulrich wrote:

> On 28 Sep 2006 08:49:25 -0700, "John Kane" <jrkr...@gmail.com>

> wrote:

>

> > I recently ran into this statement

< not very informative nor necessary statement snipped to answer

the stated question>

> >

> > I am not a statistician but this struck me as a bit strange. Can anyone

> > comment on the idea of keeping a non-significant variable in a model in

> > order to match another model?

Variables that are not "statistically significant" are kept NOT for

the reason of matching anything in most of the common usage.

Statistically REDUNDANT (superfluous, unnecessary) variables

are dropped because they not only add nothing to the model

but they may in fact make the model worse, much worse, in

terms of precision and stability.

Once you get away from those redundant variable cases, the

simplest answer to WHY you keep statisticall not-significant

variables is that for many problems, while they are not

statistically significant, they are much better than nothing. :-)

If you drop variables because they are statistically NOT

significant, they you may find, especially for sociological

data, that often you may end up with NO VARIABLE in a

regression equation because you have drop everything. :)

>

> What makes sense is to keep variables in models because you

> expect them to be meaningful.

That may be true sometimes, but often NOT true for seeking

only FITTING or PREDICTION models.

The "meaningful" idea is one of the common abuses by

social scientists in their misapplication of regression methods.

Variables do not have their unique meanings. In a multiple

regression, the meaning of a variable is its effect IN THE

PRESENCE OF ALL OTHER VARIABLES in the equation.

Therefore, the same variable may have thousands of

different meanings, all depending on which are the OTHER

variables in the equation.

> The too-frequent, common mistake is to drop variables from an

> equation merely because they fail to be "statistically significant"

> in a particular case.

That statement is clearly UNTRUE.

The dropping of variables is when the variables are "statistically

redundant" (or unnecessary) in the presence of other variables

already in the regression model.

> The question of stepwise regression has comments from years

> ago collected in my stats-FAQ.

Most comments I've seen are irrelevant, impertinent, or

technically flawed comments.

The problem in question, and the approach to the solution

of teh problem have very little, if anything, to do with

stepwise regressions.

-- Reef Fish Bob.

Sep 29, 2006, 11:07:36 AM9/29/06

to

Reef Fish wrote:

> Richard Ulrich wrote:

> > On 28 Sep 2006 08:49:25 -0700, "John Kane" <jrkr...@gmail.com>

> > wrote:

> >

> > > I recently ran into this statement

>

> < not very informative nor necessary statement snipped to answer

> the stated question>

> > >

> > > I am not a statistician but this struck me as a bit strange. Can anyone

> > > comment on the idea of keeping a non-significant variable in a model in

> > > order to match another model?

>

> Variables that are not "statistically significant" are kept NOT for

> the reason of matching anything in most of the common usage.

>

> Statistically REDUNDANT (superfluous, unnecessary) variables

> are dropped because they not only add nothing to the model

> but they may in fact make the model worse, much worse, in

> terms of precision and stability.

>

> Once you get away from those redundant variable cases, the

> simplest answer to WHY you keep statisticall not-significant

> variables is that for many problems, while they are not

> statistically significant, they are much better than nothing. :-)

>

> If you drop variables because they are statistically NOT

> significant, they you may find, especially for sociological

> data, that often you may end up with NO VARIABLE in a

> regression equation because you have drop everything. :)

>

> >

> > What makes sense is to keep variables in models because you

> > expect them to be meaningful.

But in the context (that Bob clipped) the intend seems to be to make

the model somehow comparable to another model . This does not seem to

make sense. I can see keeping the variables if you expect them to be

useful when examing another data set particularly if there is a

theoretical reason.

>

> That may be true sometimes, but often NOT true for seeking

> only FITTING or PREDICTION models.

That was my thought and this was clearly an engineering study intended

for this purpose.

>

> The "meaningful" idea is one of the common abuses by

> social scientists in their misapplication of regression methods.

And those pesky traffic engineers it appears :)

>

> Variables do not have their unique meanings. In a multiple

> regression, the meaning of a variable is its effect IN THE

> PRESENCE OF ALL OTHER VARIABLES in the equation.

>

> Therefore, the same variable may have thousands of

> different meanings, all depending on which are the OTHER

> variables in the equation.

>

>

> > The too-frequent, common mistake is to drop variables from an

> > equation merely because they fail to be "statistically significant"

> > in a particular case.

>

> That statement is clearly UNTRUE.

>

> The dropping of variables is when the variables are "statistically

> redundant" (or unnecessary) in the presence of other variables

> already in the regression model.

My problem is that I cannot see what gain there is to retaining the

variables just to make a comparison against another model. Somehow I

seem to see it as soaking up a bit of variance that might be better

explained by the other variables.

If nothing else leaving a redundent variable in the regression seems to

me to be irresponsible given that the target audience are not likely to

be researchers but either practicing traffic/civil engineers or policy

makers who may not understand the "significance" of an insignificant

variable in a model.

>

>

> > The question of stepwise regression has comments from years

> > ago collected in my stats-FAQ.

>

> Most comments I've seen are irrelevant, impertinent, or

> technically flawed comments.

>

> The problem in question, and the approach to the solution

> of teh problem have very little, if anything, to do with

> stepwise regressions.

>

> -- Reef Fish Bob.

Thanks to both of you for the comments. They have been helpful

John Kane, Kingston ON Canada

Sep 29, 2006, 12:08:56 PM9/29/06

to

Hello John ...

Oftentimes one is interested in testing the hypothesis that the

coefficients (collectively) are homogenous across groups ... leading to

the Gregory Chow Test ( Princeton University).

A similar problem in Time Series is to test for break points in

parameters i.e. is there a point in time that the coefficients fror an

ARIMA process change significantly.

We have implemented that test in order to test the idea of

non-transient structure ...which leads directly to segmenting the time

series at the identified break point(s) .

Regards

Dave Reilly

http://www.autobox.com

Sep 29, 2006, 12:25:42 PM9/29/06

to

Thanks Dave.

I see what you mean there and that makes sense. However the

researchers seem to have some idea of comparing two models, developed

on the same data set but, if my cursory reading is correct, predicting

different driver behaviour and apparently left the redundent varibles

in to 'facilitate' comparisons.

The study was a very applied one, apparently intended to provide input

to government policy on road design.

Maybe I am suspicious of the faux-3D spreadsheet barplots they used :)

They also seemed to be using stepwise regression to establish the

models, which struck me as a bit dubious.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu