------------------------------------------------------------"Using Linear Regression" from June, 1990------------------------------------------------------------ Now that you have owned your Tandy 1000 for a little while,you probably have a much better idea what a computer can andcannot do. Our Grandy computers can do some amazing things;hopefully yours is serving your needs adequately, whileamazing you at the same time. Computers remember everythingwe tell them, and they can perform math calculations quicklyand accurately. However, they cannot predict the future! Once in a while, you will see a story on TV or in the moviesinvolving computers. Sometimes, the computer will make aprediction involving the stock market, enabling our hero toinvest his money and make a bundle. In others, the computerwill pick the winning horse in the Kentucky Derby, turning aten-dollar bet into a suitcase full of money. On anotherchannel, you might see a detective use a computer to predictwhich bank the robbers will attack next. Although those scenarios are pretty far fetched, there is agrain of truth in them. No, you cannot just sit down atyour computer's keyboard, type in a question like "At whatlevel will the Dow Jones Index close on March 17", andexpect to see an answer. Your computer will respond with a"Bad command or file name" message indicating that it has noidea what you are talking about. But you can use a specialtype of program called linear regression to ask the computerto examine historical data and to make a prediction of themost likely value to occur at some point in the future basedon the trends exhibited by that data in the past. Making Predictions ------------------ How about an example? Suppose you weigh 200 pounds anddecide to go on a diet. You stick to your diet and weigh195 pounds next month, 190 pounds the month after that, and185 pounds the month after that. You do not need a computerto predict that if you stick to your diet for another month,then you will probably weigh 180 pounds by the end of thenext month. You probably guessed the correct answer beforeyour read it above; if so, then you just performed a bit oflinear regression in your head. Of course, using linear regression requires a bit of commonsense, too. You cannot expect to weigh 180 pounds nextmonth if you forget about the special diet and begin to eata gallon of ice cream every day. By changing your diet, youhave broken the special link that existed between time andweight. If you stretched the observed relationship betweentime and weight to its limit, then you might expect tocompletely disappear after three years when your predictedweight reaches zero pounds! Predictions made using linearregression are often quite useful; however, they must becritically reviewed by the person conducting the study toinsure that the answer really "makes sense". Linear regression studies are seldom as simple, as easy, oras obvious as the diet study presented above. In fact, theycan be quite complicated. Fortunately, mathematicians haveworked out a set of equations which are ideally suited foruse in a computer program. Those equations can be used todetermine the link (if one exists) between any two sets ofobservations or measurements, then use that relationship topredict the value of one variable at specified values of theother. Some Examples ------------- How might you use linear regression? Pretend that youmanage a Radio Shack franchise store. Your store has hadquite a turnover in the past year. In some months, you hadas few as three salesmen; in others, you had as many as adozen. And it should come as no surprise that sales(measured in the number of Tandy 1000's sold) varied widely,too. In one month, seven salesmen sold 99 units. In thenext month, twelve salesmen sold 152 units. With threesalesmen, you sold 81 units. In the month with fivesalesmen on staff, you sold 98 units. Last month, elevensalesmen sold 151 units, and this month, eight salesmen sold112 units. How many Tandy 1000's would you sell if you heldthe staff constant at fifteen salesmen? How many salesmendo you need to sell 200 Grandy's each month? Here is another problem: suppose you track the miles pergallon in your car. In the weeks immediately following atune-up, you get 25 miles per gallon. That mileage slowlydeteriorates over time, and you have found that it becomeseconomical to tune it up again as soon as it drops below 15miles-per-gallon level. In January, your mileage was 25miles per gallon. In February, it was down to 22. In Marchand April, it was 21 and 19 miles per gallon, respectively.In what month will it reach 15 miles per gallon? What about this application? As the city planner forSmalltown, USA, you are in charge of the city's long rangebudgeting for highway construction, school expansions, andacquisition of drinking water resources. As such, it isvital that you monitor the city's population growth so thatyou can plan for future needs. The population of Smalltownwas only 3,217 in 1950. By 1960, it had grown to 19,100. In1970, the population was 36,197. According to the sign atthe entrance to town, the population had grown to 52,914 by1980. What might the population be in the year 1990? Inwhat year will the population exceed 80,000? How would you tackle each of the problems presented above?You can make an "educated guess" using a pencil and graphpaper, but that type of estimate is not really precise.Those of you who are good with a calculator can probablydevelop some reasonably good answers, but that kind ofapproach can be rather tedious. Perhaps the best method isthe one found in this month's program listing. That programuses the math techniques mentioned earlier to make each ofthese three problems (and many others like them) a breeze. Do It Yourself -------------- Let's assume the program is now error-free and ready totest. We will use the first example above (the Radio Shackstore problem) to test the new program. If you just typedthe program in, then it is still "in memory"; you can justtype RUN and press ENTER to begin. If you entered theprogram during an earlier session, you must first enter thecommand BASIC from the A> or B> or C> prompt and pressENTER. Once you reach the old familiar "Ok" prompt, typeRUN "REGRESS" and press ENTER. In only a few moments, the screen will clear, the title"LINEAR REGRESSION PROGRAM" will appear, and a shortparagraph describing the program will be presented. Therewill also be a question on screen asking for "the name ofthe first variable or group of measurements". Re-read the example and you will see that there are twoitems involved in this study -- the number of salesmenemployed in any month, and the number of computers that theysold. Answer the question by providing a name for the firstgroup; enter "Number of Salesmen", or "# Salesmen", or just"Salesmen" and press ENTER. When you are asked "the name ofthe other variable or group of measurements", enter "Numberof Computers Sold" or something similar, then press ENTER. In a moment, the program will ask you for "a measured orobserved value for the Number of Salesmen". Remember, inone month, seven salesmen sold 99 computers. Enter thenumber seven and press ENTER. When you are asked for the"corresponding value of Numbers of Computers Sold", you canenter 99 and press ENTER. When prompted for another pair ofmeasurements, enter the number 12 for the "number ofsalesmen" and 152 for the "number of computers sold". After two entries, you will be asked if "you want to enteranother pair of points (Y/N)?" If you have more informationto enter (and we do!), press the letter "Y" for "yes". Enterthe other pairs of data -- three salesmen and 81 computerssold, five salesmen and 98 computers sold, eleven salesmenand 151 computers sold, and eight salesmen and 112 computersold. When there are no more salesmen vs. sales data, thenyou must enter "N" for "no" at the yes-or-no question. Ifyou make an incorrect entry and catch it before you pressENTER, you can use the BACK SPACE key to make a correction.However, once you press ENTER, there is no way to go backand make a change. In that event, press CTRL-BREAK to haltexecution, then type RUN and start over from the beginning. Program Options --------------- Once all of the data has been entered, you should see a setof seven program options. You can press "1" to review thedata entered thus far. If you needed to add one or morepairs of data, you could press "2". You can press "3" tohave the program calculate and display an interestingquantity referred to as the correlation coefficient. Whatdoes it mean? The correlation coefficient tells you howconfident the computer is in making predictions using thedata you have entered. If that coefficient is 100%, thenthe computer is absolutely certain that the predictions itis about to make will be correct. If the coefficient is 0%,then the computer realizes that the predictions will almostcertainly be wrong. Of course, in "real life" the correlation coefficient isusually somewhere in between 0 and 100%. From myexperience, any coefficient greater than 70 or 80% issignificant; predictions made using that regression areprobably fairly reliable. If the coefficient is less than70%, then the results should be viewed with caution. Thecoefficient for the computer salesmen problem is about 94%,which signifies a very good correlation and a high degree ofconfidence. Option "4" on the main menu allows you to "Predict a Valuefor Number of Computers Sold Given the Number of Salesmen".Choose that option, then when asked to enter a value for thenumber of salesmen, enter 15 and press ENTER. According tothe program, fifteen salesmen would sell 176.5562 computerseach month. Does that mean that if you hired fifteensalesmen you would sell exactly 176.5 computers each month?Of course not, although 176 is the computer's best estimateusing the information you provided. Some months you mightsell more, some months less. But in the long run, you wouldbe expected to sell an average of 176.5 computers eachmonth. Other Choices ------------- Option "5" on the main menu allows you to "Predict a Valuefor Number of Salesmen Given the Number of Computers Sold".Choose that option, then when asked to enter a value for thenumber of computers sold, enter 200 and press ENTER. Onceagain, the computer's analysis of the situation reveals thatin order to sell 200 computers per month, you must hireexactly 17.81579 salesmen. Since it is hard to hire afraction of a salesman, let's round that up to 18 people.Does that guarantee that you will sell exactly 200 computerseach month? Not necessarily! Some of the new hires may begoof-offs, others may be go-getters. But (and you may haveheard this before), 18 salesmen is the computer's bestestimate of the number needed to sell 200 computers usingthe information you provided. Option "6" allows you to begin a new study, while option "7"exits the program and returns you to BASIC's OK prompt. Ifyou do select option "7", you can do several things once youreach the OK prompt. You can type RUN and press ENTER toexecute the program again. You can type LIST to look at theprogram's statements again. You can type RUN "ANOTHER" toload and execute another program. Or you can type SYSTEM toreturn to DOS. How It Works ------------ Let's look at the program listing more closely to see howthis "magic" is accomplished. Lines 140 through 190 clearthe screen, select the screen colors, and present some shortintroductory text. Notice that the DIM statement at the endof line 140 defines two arrays which will be used to holdthe information you enter. If you plan to enter a set ofdata with more than 20 pairs of observations, you must editline 140 to change both "20" values to a larger number. Lines 230 through 260 accept names for each of the tworelated sets of numbers you will be entering. Those namesare stored in string variables XNAME$ and YNAME$. Lines 300through 330 are used to accept values for a pair of entries.They are stored in the X and Y arrays dimensioned back inline 140. Counter variable N keeps track of how many pairsyou have entered. Notice in line 340 that if you haveentered only one pair of points, you are immediately sentback to line 300 to enter a second pair. It is impossibleto make any predictions with only one pair of observations. Lines 350 through 370 check to see whether you have morepairs of observations to enter. Line 360 uses the INKEY$statement so that you need not press ENTER. Lines 360 and370 check for both upper and lower case letter responses.The table of program options is presented in lines 410through 480. The user's response is captured in line 490,and depending on that response, the program jumps to theproper location from line 500. Menu Options ------------ Lines 540 through 570 print a table which allows the user toreview all of the information he or she has entered. Lines580 and 590 wait for some type of keystroke before returningto the main menu in line 410. Lines 630 and 640 present the value for the correlationcoefficient. That is accomplished by calling subroutine820. That subroutine performs a number of calculations, oneof which is the determination of R2. R2 is a form of thecorrelation coefficient; its value lies between negative oneand positive one. Line 640 takes its absolute value thenmultiplies by 100 to transform it into a number between 0and 100. Lines 680 through 700 are used to predict a Y value given anX value. The call to subroutine 820 in line 680 calculatesthe values SLOPE and INTER needed to make the prediction.Lines 680 and 690 accept a new X value. The predicted valueof Y is calculated in the second half of line 690, and it isdisplayed in line 700. Lines 710 through 730 are verysimilar, except that they use a specified Y value to predicta value for X. Line 770 is called when the user decides to begin a newstudy. Line 780 is called when the user decides to exit theprogram; the colors are reset to the default values, thescreen is cleared, and the programs ends. As mentionedbefore, subroutine 820 performs the key calculations thatare necessary to make predictions. Lines 820 through 850calculate a number of summations -- the sum of all the Xvalues, all of the Y values, all of the X values squared,all of the Y values squared, and all of the products of Xtimes Y. Those quantities -- in variables SX, SY, SX2, SY2,and SXY, respectively -- are used in lines 860 through 890to calculate variables SLOPE, INTER, STDVX, and STDVY. Line900 calculates the correlation coefficient R2. Wrapping Up ----------- As a final exercise, why don't you try the other sampleproblems presented above? The car mileage problemillustrates an important point. All of the data you entermust be in numerical format. Thus, you cannot enter themonths using the words "January", "February", and "March";you must enter them as numbers -- 1 for January, 2 forFebruary, and 3 for March. Just for the record, here arethe answers I found: it will be time to tune the car againin June. The population of Smalltown will be 69,404 in1990, and it will reach 80,000 sometime in the first half of1996. I hope you find this month's program interesting anduseful. If you have any questions or comments, write to mein care of ONE THOUSAND magazine. See you next month!