Principles of Workflow in Data Analysis

Σχετικά έγγραφα
Principles of Workflow in Data Analysis

ΤΕΛΙΚΗ ΕΞΕΤΑΣΗ ΔΕΙΓΜΑ ΟΙΚΟΝΟΜΕΤΡΙΑ Ι (3ο Εξάμηνο) Όνομα εξεταζόμενου: Α.Α. Οικονομικό Πανεπιστήμιο Αθήνας -- Τμήμα ΔΕΟΣ Καθηγητής: Γιάννης Μπίλιας

MATHACHij = γ00 + u0j + rij

!"!"!!#" $ "# % #" & #" '##' #!( #")*(+&#!', & - #% '##' #( &2(!%#(345#" 6##7

ΟΙΚΟΝΟΜΕΤΡΙΑ 2 ΦΡΟΝΤΙΣΤΗΡΙΟ 2 BASICS OF IV ESTIMATION USING STATA

Π.Μ.Σ. ΒΙΟΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ ΔΙΑΣΠΟΡΑΣ ΚΑΙ ΠΑΛΙΝΔΡΟΜΗΣΗΣ ΤΕΛΙΚΟ ΔΙΑΓΩΝΙΣΜΑ 27/6/2016

2. ΧΡΗΣΗ ΣΤΑΤΙΣΤΙΚΩΝ ΠΑΚΕΤΩΝ ΣΤΗ ΓΡΑΜΜΙΚΗ ΠΑΛΙΝΔΡΟΜΗΣΗ

Comparing groups in regression models for binary outcomes

Προσομοίωση BP με το Bizagi Modeler

Εγκατάσταση λογισμικού και αναβάθμιση συσκευής Device software installation and software upgrade

The challenges of non-stable predicates

Bizagi Modeler: Συνοπτικός Οδηγός

LAMPIRAN. Fixed-effects (within) regression Number of obs = 364 Group variable (i): kode Number of groups = 26

Πανεπιστήμιο Πειραιώς Τμήμα Πληροφορικής Πρόγραμμα Μεταπτυχιακών Σπουδών «Πληροφορική»

ΓΕΩΠΟΝΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙO ΑΘΗΝΩΝ ΤΜΗΜΑ ΑΞΙΟΠΟΙΗΣΗΣ ΦΥΣΙΚΩΝ ΠΟΡΩΝ & ΓΕΩΡΓΙΚΗΣ ΜΗΧΑΝΙΚΗΣ

ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΙΑΣ ΤΜΗΜΑ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ ΤΟΜΕΑΣ ΥΔΡΑΥΛΙΚΗΣ ΚΑΙ ΠΕΡΙΒΑΛΛΟΝΤΙΚΗΣ ΤΕΧΝΙΚΗΣ. Ειδική διάλεξη 2: Εισαγωγή στον κώδικα της εργασίας

Partial Trace and Partial Transpose

Approximation of distance between locations on earth given by latitude and longitude

Τμήμα Πολιτικών και Δομικών Έργων

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

The Simply Typed Lambda Calculus

Numerical Analysis FMN011

ΣΧΕΔΙΑΣΜΟΣ ΕΠΙΓΕΙΟΥ ΣΥΣΤΗΜΑΤΟΣ ΑΛΥΣΟΚΙΝΗΣΗΣ ΓΙΑ ΜΕΤΑΦΟΡΑ ΤΡΟΛΕΪ

Reminders: linear functions

Durbin-Levinson recursive method

SPEEDO AQUABEAT. Specially Designed for Aquatic Athletes and Active People

5.4 The Poisson Distribution.

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Elements of Information Theory

Notes on the Open Economy

Démographie spatiale/spatial Demography

ΤΕΧΝΙΚΕΣ ΔΙΑΓΝΩΣΗΣ ΤΗΣ ΝΟΣΟΥ ΑΛΤΣΧΑΙΜΕΡ ΜΕ FMRI

Other Test Constructions: Likelihood Ratio & Bayes Tests

ΤΟ ΣΤΑΥΡΟΔΡΟΜΙ ΤΟΥ ΝΟΤΟΥ ΤΟ ΛΙΜΑΝΙ ΤΗΣ ΚΑΛΑΜΑΤΑΣ

ΠΑΝΔΠΙΣΗΜΙΟ ΜΑΚΔΓΟΝΙΑ ΠΡΟΓΡΑΜΜΑ ΜΔΣΑΠΣΤΥΙΑΚΧΝ ΠΟΤΓΧΝ ΣΜΗΜΑΣΟ ΔΦΑΡΜΟΜΔΝΗ ΠΛΗΡΟΦΟΡΙΚΗ

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

ΕΘΝΙΚΗ ΣΧΟΛΗ ΔΗΜΟΣΙΑΣ ΔΙΟΙΚΗΣΗΣ ΙΓ' ΕΚΠΑΙΔΕΥΤΙΚΗ ΣΕΙΡΑ

Test Data Management in Practice

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

ΟΡΓΑΝΙΣΜΟΣ ΒΙΟΜΗΧΑΝΙΚΗΣ ΙΔΙΟΚΤΗΣΙΑΣ

Advanced Subsidiary Unit 1: Understanding and Written Response

ΑΓΓΛΙΚΑ Ι. Ενότητα 7α: Impact of the Internet on Economic Education. Ζωή Κανταρίδου Τμήμα Εφαρμοσμένης Πληροφορικής

LESSON 14 (ΜΑΘΗΜΑ ΔΕΚΑΤΕΣΣΕΡΑ) REF : 202/057/34-ADV. 18 February 2014

ΤΕΧΝΟΛΟΓΙΚΟ ΕΚΠΑΙ ΕΥΤΙΚΟ Ι ΡΥΜΑ ΚΡΗΤΗΣ ΣΧΟΛΗ ΙΟΙΚΗΣΗΣ ΚΑΙ ΟΙΚΟΝΟΜΙΑΣ ΤΜΗΜΑ ΙΟΙΚΗΣΗΣ ΕΠΙΧΕΙΡΗΣΕΩΝ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

6.3 Forecasting ARMA processes

Τ.Ε.Ι. ΔΥΤΙΚΗΣ ΜΑΚΕΔΟΝΙΑΣ ΠΑΡΑΡΤΗΜΑ ΚΑΣΤΟΡΙΑΣ ΤΜΗΜΑ ΔΗΜΟΣΙΩΝ ΣΧΕΣΕΩΝ & ΕΠΙΚΟΙΝΩΝΙΑΣ

Assalamu `alaikum wr. wb.

ΣΤΥΛΙΑΝΟΥ ΣΟΦΙΑ

ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Math 6 SL Probability Distributions Practice Test Mark Scheme

Keystroke-Level Model

Εισαγωγή στη Βιοπληροφορική

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Ψηφιακή ανάπτυξη. Course Unit #1 : Κατανοώντας τις βασικές σύγχρονες ψηφιακές αρχές Thematic Unit #1 : Τεχνολογίες Web και CMS

ΠΟΛΥΤΕΧΝΕΙΟ ΚΡΗΤΗΣ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

Abstract Storage Devices

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

ΠΕΡΙΕΧΟΜΕΝΑ. Κεφάλαιο 1: Κεφάλαιο 2: Κεφάλαιο 3:

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Οδηγίες χρήσης. Registered. Οδηγίες ένταξης σήματος D-U-N-S Registered στην ιστοσελίδα σας και χρήσης του στην ηλεκτρονική σας επικοινωνία

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

encouraged to use the Version of Record that, when published, will replace this version. The most /BCJ BIOCHEMICAL JOURNAL

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Srednicki Chapter 55

Lecture 8: Serial Correlation. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Project: 296 File: Title: CMC-E-600 ICD Doc No: Rev 2. Revision Date: 15 September 2010

Buried Markov Model Pairwise

ΑΥΤΟΜΑΤΟΠΟΙΗΣΗ ΜΟΝΑΔΑΣ ΘΡΑΥΣΤΗΡΑ ΜΕ ΧΡΗΣΗ P.L.C. AUTOMATION OF A CRUSHER MODULE USING P.L.C.

Architecture οf Integrated Ιnformation Systems (ARIS)

519.22(07.07) 78 : ( ) /.. ; c (07.07) , , 2008

Section 8.3 Trigonometric Equations

[2] T.S.G. Peiris and R.O. Thattil, An Alternative Model to Estimate Solar Radiation

FINAL TEST B TERM-JUNIOR B STARTING STEPS IN GRAMMAR UNITS 8-17

ΠΑΡΑΜΕΤΡΟΙ ΕΠΗΡΕΑΣΜΟΥ ΤΗΣ ΑΝΑΓΝΩΣΗΣ- ΑΠΟΚΩΔΙΚΟΠΟΙΗΣΗΣ ΤΗΣ BRAILLE ΑΠΟ ΑΤΟΜΑ ΜΕ ΤΥΦΛΩΣΗ

Δίκτυα Δακτυλίου. Token Ring - Polling

C.S. 430 Assignment 6, Sample Solutions

Appendix A3. Table A3.1. General linear model results for RMSE under the unconditional model. Source DF SS Mean Square

VBA ΣΤΟ WORD. 1. Συχνά, όταν ήθελα να δώσω ένα φυλλάδιο εργασίας με ασκήσεις στους μαθητές έκανα το εξής: Version ΗΜΙΤΕΛΗΣ!!!!

Summary of the model specified

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

Matrices and Determinants

Study of urban housing development projects: The general planning of Alexandria City

derivation of the Laplacian from rectangular to spherical coordinates

ΓΗΠΛΧΜΑΣΗΚΖ ΔΡΓΑΗΑ ΑΡΥΗΣΔΚΣΟΝΗΚΖ ΣΧΝ ΓΔΦΤΡΧΝ ΑΠΟ ΑΠΟΦΖ ΜΟΡΦΟΛΟΓΗΑ ΚΑΗ ΑΗΘΖΣΗΚΖ

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

EE512: Error Control Coding

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΜΕΑΣ ΣΥΣΤΗΜΑΤΩΝ ΗΛΕΚΤΡΙΚΗΣ ΕΝΕΡΓΕΙΑΣ

Πανεπιστήµιο Μακεδονίας Οικονοµικών και Κοινωνικών Επιστηµών Τµήµα Εφαρµοσµένης Πληροφορικής

ΚΑΘΟΡΙΣΜΟΣ ΠΑΡΑΓΟΝΤΩΝ ΠΟΥ ΕΠΗΡΕΑΖΟΥΝ ΤΗΝ ΠΑΡΑΓΟΜΕΝΗ ΙΣΧΥ ΣΕ Φ/Β ΠΑΡΚΟ 80KWp

Δίκτυα Επικοινωνιών ΙΙ: OSPF Configuration

Μηχανισμοί πρόβλεψης προσήμων σε προσημασμένα μοντέλα κοινωνικών δικτύων ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

Οδηγίες χρήσης υλικού D U N S Registered

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ Σχολή Πολιτικών Μηχανικών Τοµέας οµοστατικής ΑΛΛΗΛΕΠΙ ΡΑΣΗ ΑΣΤΟΧΙΑΣ ΑΠΟ ΛΥΓΙΣΜΟ ΚΑΙ ΠΛΑΣΤΙΚΟΠΟΙΗΣΗ ΣΕ ΜΕΤΑΛΛΙΚΑ ΠΛΑΙΣΙΑ

ΜΑΘΗΜΑΤΙΚΗ ΠΡΟΣΟΜΟΙΩΣΗ ΤΗΣ ΔΥΝΑΜΙΚΗΣ ΤΟΥ ΕΔΑΦΙΚΟΥ ΝΕΡΟΥ ΣΤΗΝ ΠΕΡΙΠΤΩΣΗ ΑΡΔΕΥΣΗΣ ΜΕ ΥΠΟΓΕΙΟΥΣ ΣΤΑΛΑΚΤΗΦΟΡΟΥΣ ΣΩΛΗΝΕΣ ΣΕ ΔΙΑΣΤΡΩΜΕΝΑ ΕΔΑΦΗ

ΕΠΛ221: Οργάνωση Υπολογιστών και Συμβολικός Προγραμματισμός. Εργαστήριο Αρ. 2

Hancock. Ζωγραφάκης Ιωάννης Εξαρχάκος Νικόλαος. ΕΠΛ 428 Προγραμματισμός Συστημάτων

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Transcript:

WorkshopinMethods PrinciplesofWorkflowin DataAnalysis ScottLong IndianaUniversity November18,2009 Workflow involves the entire craft of data analysis 1.Planning,organizinganddocumentingresearch 2.Cleaningdataandcreatingvariables 3.Estimatingmodelsandcreatinggraphs 4.Presentingandpublishingfindings 5.Archivingandbackingupmaterials 1.YourWFmightbe A.Plannedandcarefullyorchestrated. B.Adhoc,piecemeal,developedinreactiontomistakes. C.Good,bad,orugly. 2.AlmostcertainlyyoucanimproveyourWFwithamodestinvestmentoftime. A.Thelessexperienceyouhave,theeasieritis! B.Itwillsaveyoutimeandmakeyouabetterdataanalyst.

1.Replication o Agoodworkflowisessentialforreplication. o Replicationisessentialforgoodscience. 2.Gettingtherightanswers o Retractionsareembarrassingandcanendcareers. 3.Time o Scienceisavoraciousinstitution. o Boringthingsshouldtakeaslittletimeaspossible. 4.TheIUadvantage Thepublicationof[WFDAUS]mayevenreduceIndiana scomparative advantageofproducinghotshotquantphdsnowthatgradstudents elsewherecanvicariouslybenefitfromthisimportantaspectofthe trainingthere. GabrielRossmanonhisblog Spending my time.. 1.Fixingeasythings:timeconsultingoneasythings,nothardthings. 2.Lookingatincorrectresultsandclever explanations. 3.Adissertationdelayed18monthstofindwhyresultschanged. 4.Irreproducibleresultsfromasingle,743linedofile. 5.Analyzingthewrongdataset: Thedatasetsareexactlythesameexcept thatichangedthemarriedvariable. 6.WastingtimeanalyzingthewrongvariablewhilewritinganNASreport. 7.Miscodedgenesthedelayedprogressonastudyofalcholism. 8.Collaborationsthatmultiplythewaysthingscangowrong. 9.Misleadingorambiguousoutputsuchas... Example 1: definitely not good!

Example 2: which number is which? Example 3: good software doing things badly Page963ofStata11manual. use margex (Artificial data for margins). regress y i.sex i.group Source SS df MS Number of obs = 3000 -------------+------------------------------ F( 3, 2996) = 152.06 Model 183866.077 3 61288.6923 Prob > F = 0.0000 Residual 1207566.93 2996 403.059723 R-squared = 0.1321 -------------+------------------------------ Adj R-squared = 0.1313 Total 1391433.01 2999 463.965657 Root MSE = 20.076 ------------------------------------------------------------------------------ y Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.sex 18.32202.8930951 20.52 0.000 16.57088 20.07316 group 2 8.037615.913769 8.80 0.000 6.245937 9.829293 3 18.63922 1.159503 16.08 0.000 16.36572 20.91272 _cons 53.32146.9345465 57.06 0.000 51.48904 55.15388 ------------------------------------------------------------------------------. margins sex Predictive margins Number of obs = 3000 Model VCE : OLS Expression : Linear prediction, predict() ------------------------------------------------------------------------------ Delta-method Margin Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- sex 0 60.56034.5781782 104.74 0.000 59.42713 61.69355 1 78.88236.5772578 136.65 0.000 77.75096 80.01377 ------------------------------------------------------------------------------ Thenumbersreportedinthe Margin columnareaveragevaluesofy. Basedonalinearregressionofyonsexandgroup,60.6wouldbethe averagevalueofyifeveryoneinthedataweretreatedasiftheywere sex=0,and78.9wouldbetheaveragevalueifeveryoneweretreatedasif theyweresex=1.

2004 2005(1stEdition) Workflow,notslow.BruceFraser Thenamewasn tacoincidence. Itrequires: 1.masteringtacitknowledge 2.lotsofheavylifting 1.Explicitknowledgeisthestuffoftextbooksandarticles. 2.Tacitknowledgeisimplicitandundocumented(MichaelPolanyi). 3.Peoplecanbeunawareoftheirtacitknowledgeandhowvaluableitis. o HenryBessemer spatentformakingsteel(1855) 4.Tacitknowledgeistransferredthroughpersonalcontact. o WFisacraftandcraftsarelearned atthebench. o Personalcomputersimpedethistransferofthecraft. There is a lot of heavy lifting in doing data analysis well. Thereality,ofcourse,todayisthatifyoucomeupwithagreatideayou don'tgettogoquicklytoasuccessfulproduct.there'salotof undifferentiatedheavyliftingthatstandsbetweenyourideaandthat success. WorkflowofDataAnalysisUsingStata 1.MakestacitknowledgeaboutWFexplicit. 2.Itdealswithalotofundifferentiatedheavylifting. 3.Itcontainsspecificsonthegeneralissuesdicussedtoday

replication 1.SciencedemandsreplicabilityandagoodWFfacilitatesreplication. 2.Anticipatetheneedtoreplicatefromthestart,notafteryourworkhas beenchallenged. 3.Manydisciplinesareworriedaboutreplicability. o ArticlesinPoliticalScience,Economics,Sociologyandotherfields. 4.Callsforjournalstodepositallanalysesaregrowing. o Areyourdofilesandlogfilesreadyforpublicdisplay? 1.Thecurseofdimensionality:10minordecisions,leadsto1,024reasonable waystocreateyourdata. o Adecisionwheretotruncateavariable. o ThatpeskyseedfortheRNgenerator. o Howtohandlemissingdataina5variablescale. o Decisionsonwhichcasestokeepforanalysis. o Andsoon... Decisions in the path to analysis: the choices that could be made

Decisions in the path to analysis: the choices made 2.Lackofdocumentation:Replicationshouldinvolveretrieving documentation,nottryingtorememberwhatyoudid. 3.Changingsoftware:2weeksofsleeplessnightsduetoversionvariation. 4.Lostfiles:corrupted,lost,unreadable,orambiguousfiles. 5.Mistakesaremade.Theyareunavoidable,butagoodWFcanhelpyou catchandfixthem. ironicaloptimism Theuniversalaptitudeforineptitudemakesanyhumanaccomplishment anincrediblemiracle.dr.johnpaulstapp

Steps o Thedatamustbeaccurate. o Thevariablesmustbecarefullynamedandlabeled. o Thistakes90%ofthetime,unlessyouhurry. o Estimatingmodelsandcomputinggraphs. o Thisisoftenthesimplestpartoftheworkflow. o Incorporatingoutputintoyourpresentation. o Makingaclearpresentation. o Maintainingtheprovenanceofresults. o Backingupandarchiving:preservingthebitsversusmaintainingthe information. o "Today'snoiseistomorrow'sknowledge."DavidClemmer o Replicationisimpossiblewithoutthedataanddofiles. Tasks

Tasks Tasks Tasks

Tasks Tasks The ideal BlauandDuncan(1967)TheAmericanOccupationalStructure:Allanalyseswere specified9monthsbeforeoutputwasreceived.then,thebookwaswritten basedentirelyonthoseanalyses.

Issues in planning 1.Alittleplanninggoesalongwayandalmostalwayssavestime. 2.Aplanisaremindertostayontrack,finishtheproject,andpublishresults. Work.Finish.Publish.MichaelFaraday ssigninhislab 3.Mostpeoplepreferto dosomething ratherthanplan. 4.Planningincludes: a.generalgoalsandpublishingplans b.scheduling c.divisionoflabor d.datasetsneeded e.variablenamesandlabels f.missingdataprocedures g.analysis h.documentation i. Backingupandarchivingmaterials 1.Organizationisdrivenbyneedingtofindthingsandavoidduplication. 2.Carefulorganizationhelpsyouworkfaster. 3.Organizationrewards... A.Consistencyanduniformity B.Asimplestructurethatisnottoosimple. C.Thinkingsystematicallyabouthowyounameandstorethings. 4.Organizationiscontagious:startorganizedtoendorganized. 1.Youhavemultipleversionsofafileanddon'tknowwhichiswhich. 2.Youcan'tfindafileandthinkyoumighthavedeletedit. 3.Youandacolleaguearebothworkingondifferentversionsofthesame paper.youchangedwhatshechangedandnowyouhavethreeversionsof thepaper. 4.Youneedthefinalversionofthepaperthewassubmittedforreview,but youhavetwofileswith"final"inthename.

1.Itiseasiertocreateafilethantofindafile. 2.Itiseasiertofindafilethantoknowwhatisinthefile. 3.Withdiskspacesocheap,itistemptingtocreatealotoffiles. Organizing: a standard directory structure for all projects \WF project \- History \2009-03-06 project directory created \- Hold then delete \- Pre posted \- To clean \Documentation \Posted \Resources \Text \- Versions \Work \- To do Organizing: wfsetupsingle.bat makes it easy REM workflow talk 2 \ wfsetupsingle.bat jsl 2009-07-12 REM directory structure for single person. FOR /F "tokens=2,3,4 delims=/- " %%a in ("%DATE%") do set CDATE=%%c-%%a-%%b md "- History\%cdate% project directory created" md "- Hold then delete " md "- Pre posted " md "- To clean" md "Documentation" md "Posted" md "Resources" md "Text\- Versions\" md "Work\- To do"

capture log close log using wftalk-example, replace text // program: wftalk-example.do // task: alt-1 creates this file // project: wftalk example // author: jsl \ 2009-07-23 // #0 // program setup version 11 clear all set linesize 80 * matrix drop _all local tag "wftalk-example.do jsl 2009-07-23" // #1 // <task description> // #2 // <task description> log close exit 1.Long'sLaw:Itisalwaysfastertodocumentittodaythantomorrow. Corollary1:Nobodylikestowritedocumentation. Corollary2:Nobodyregretshavingwrittendocumentation. Haveyoueversaid:"Drat,thisprogramhastoomanycomments." 2.Withoutdocumentation,replicationisvirtuallyimpossible,mistakesare morelikely,andworktakeslonger. 3.Themorecodifiedthefieldthegreatertheemphasisondocumentation. A.TheResearchLogbytheACS B.Lossoftenureforanalteredresearchlog. 4.Documentationoccursatmanylevels:logs,metadata,comments,names. 1.Itisfastertodocumentittodaythantomorrow. 2.Tokeepupwithdocumentation,tieittoeventsinyourwork. 3.Writeittoday;checkitlater. 4.Includefulldatesandnames. Arealexample(expletive sdeleted)...

1.Executioninvolvescarryingoutspecifictaskswithineachstep. 2.Effectiveexecutionrequirestherighttoolsforthejob. o Software Texteditor Filemanager Statisticalsoftware Wordprocessor Macroprogram o Hardware Display Storage Centralprocessor The key an effective workflow is planning, not computing. 1.Randomlydivideyourselvesintotwogroups. 2.Thecomputersareallowedtocomputewheneveryoulikewhenwriting yourdissertation. 3.Theplannershaveaccesstoacomputerfortwosixhoursessionsaweek. 4.Thewager:theplannersfinishfirst. Does cheap computing help? Thehistoricalcontextofcomputing...

Cornell 1975: the entire computing infrastructure IBM370with240Kmemory Winchesterdriveswith3MBstorage Costofcomputing$1,000,000. Meantimetodegree7.6years. Indiana 2009: a disposable PC Asus1000HEwith2GBmemory FreeAgentwith1TBstorage 10,000timesmore 350,000timesmore... Costofcomputing$400(2,500timesless). Meantimetodegree7.6years. Criteria o Ifyourprogramisnotcorrect,thennothingelsematters. OliveiraandStewart o Completeworkquicklygivenaccuracyandreplicability. o Tensionbetweenworkingquicklyandworkingcarefully. othemorecomplicatedyourproceduresthemorelikelyyouwillmake mistakesorabandonyourplan.

o Youdon'thavetorepeatedlydecidehowtodothings. o Withstandardization,itiseasiertofindmistakes. o Automatedprocedurespreventmistakes. o Drukker'sDictim:Nevertypeanythingthatyoucanobtainfromasaved result. o Yourworkflowshouldreflectthewayyou liketowork. o IfyouignoreyourWFplan,itisnotagoodWF. o Differentprojectsmightrequiredifferentworkflows. Dualworkflowrunorder 1.Dualworkflow:keepdatamanagementanddataanalysisseparate. 2.Runorder:namefilessothatiftheyarereruninalphabeticalorder,youwill produceexactlythesameresults. 3.Postingprinciplewithtworules a.thesharerule:onlyshareresultsaftertheassociatedfilesareposted. b.thenochangerule:onceafileisposted,neverchangeit. Datamanagement==> <==Dataanalysis

Files from a dual workflow Datamanagement Dataanalysis data01.do stat01a.do data02v2.do stat01b.do data03.do stat01cv2.do data03-1.do data03-2.do stat02a.do data04.do stat02a1.do stat02b.do stat03av2.do stat03b.do stat03c.do stat03c1.do stat03c2v2.do stat03d.do postingprinciple 1.Postingsimplymeanssavingfilesasfinal. 2.Thepostingprincipleinvolvestworules a.thesharerule:onlyshareresultsaftertheassociatedfilesareposted. b.thenochangerule:onceafileisposted,neverchangeit 3.Neverdoesnotmean: a.rarelydone. b.justalittlechange. c.reallysoonafterposting. 1.Selfcontained 2.Versioncontrol 3.Excludedirectoryinformation 4.Includeseedsforrandomnumbers 5.Archiveuserwrittenadofiles 1.Lotsofcomments;evenmorethanthat! 2.Alignmentandindentation 3.Shortlineswithnowrapping 4.Avoidabbreviations: l a l in 1/3

+----------------+ Key ---------------- frequency row percentage +----------------+ Years of education Occupation 3 6 7 8 9 10 11 12 13 Total -----------+---------------------------------------------------------------------------- -----------------------+---------- Menial 0 2 0 0 3 1 3 12 2 31 0.00 6.45 0.00 0.00 9.68 3.23 9.68 38.71 6.45 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- BlueCol 1 3 1 7 4 6 5 26 7 69 1.45 4.35 1.45 10.14 5.80 8.70 7.25 37.68 10.14 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- Craft 0 3 2 3 2 2 7 39 7 84 0.00 3.57 2.38 3.57 2.38 2.38 8.33 46.43 8.33 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- WhiteCol 0 0 0 1 0 1 2 19 4 41 0.00 0.00 0.00 2.44 0.00 2.44 4.88 46.34 9.76 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- Prof 0 0 1 1 0 0 2 13 10 112 0.00 0.00 0.89 0.89 0.00 0.00 1.79 11.61 8.93 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- Total 1 8 4 12 9 10 19 109 30 337 0.30 2.37 1.19 3.56 2.67 2.97 5.64 32.34 8.90 100.00 Years of education Occupation 3 6 7 8 9 10 11 12 13 Total -----------+------------------------------------------------------------------- ------------------------------+---------- Menial 0 2 0 0 3 1 3 12 2 31 0.00 6.45 0.00 0.00 9.68 3.23 9.68 38.71 6.45 100.00 -----------+------------------------------------------------------------------- ------------------------------+---------- BlueCol 1 3 1 7 4 6 5 26 7 69 1.45 4.35 1.45 10.14 5.80 8.70 7.25 37.68 10.14 100.00 -----------+------------------------------------------------------------------- ------------------------------+---------- Craft 0 3 2 3 2 2 7 39 7 84 0.00 3.57 2.38 3.57 2.38 2.38 8.33 46.43 8.33 100.00 -----------+------------------------------------------------------------------- ------------------------------+----------

Muchofyourworkinvolves repetitivetasksthatinvite error.automationmakes workeasier,faster,less errorprone. 1.macros 2.loops 3.returnedresults 4.matrices 5.adofiles 6.includefiles 7.helpmefiles 1.Lowlevelsofautomationanddocumentation wf-automation-1low.do 2.Mediumlevelsofautomationanddocumentation wf-automation-2medium.do 3.Highlevelsofautomationanddocumentation wf-automation-3high.do+wf-automation-3high.doi wf-automation-3highv2.do + wf-automation-3highv2.doi Example:ownsexandownsexucausedweeksofconfusion.

Cleaning 1: finding an error with a graph

Cleaning 1: reversing the graph Cleaning 2: remembering a coding decision Cleaning 3: understanding the substantive process

Cleaning 4: avoiding expensive mistakes 1. Takelotsofclassesinstatistics. 2. Findexemplars;don tdo yourway. 1.Contentandmethodsaresubstantive,disciplinarydecisions. 2.Presentationsandpreservationofprovenancearemoregeneric. mlogit (N=337): Factor Change in the Odds of occ Variable: white (sd=.27642268) Odds comparing Alternative 1 to Alternative 2 b z P> z e^b e^bstdx ------------------+--------------------------------------------- Menial -BlueCol -1.23650-1.707 0.088 0.2904 0.7105 Menial -Craft -0.47234-0.782 0.434 0.6235 0.8776 Menial -WhiteCol -1.57139-1.741 0.082 0.2078 0.6477 Menial -Prof -1.77431-2.350 0.019 0.1696 0.6123 BlueCol -Menial 1.23650 1.707 0.088 3.4436 1.4075 BlueCol -Craft 0.76416 1.208 0.227 2.1472 1.2352 BlueCol -WhiteCol -0.33488-0.359 0.720 0.7154 0.9116 BlueCol -Prof -0.53780-0.673 0.501 0.5840 0.8619 Craft -Menial 0.47234 0.782 0.434 1.6037 1.1395 Craft -BlueCol -0.76416-1.208 0.227 0.4657 0.8096 Craft -WhiteCol -1.09904-1.343 0.179 0.3332 0.7380 Craft -Prof -1.30196-2.011 0.044 0.2720 0.6978 WhiteCol-Menial 1.57139 1.741 0.082 4.8133 1.5440 WhiteCol-BlueCol 0.33488 0.359 0.720 1.3978 1.0970 WhiteCol-Craft 1.09904 1.343 0.179 3.0013 1.3550 WhiteCol-Prof -0.20292-0.233 0.815 0.8163 0.9455 Prof -Menial 1.77431 2.350 0.019 5.8962 1.6331 Prof -BlueCol 0.53780 0.673 0.501 1.7122 1.1603 Prof -Craft 1.30196 2.011 0.044 3.6765 1.4332 Prof -WhiteCol 0.20292 0.233 0.815 1.2250 1.0577 ---------------------------------------------------------------- Variable: ed (sd=2.9464271) Odds comparing Alternative 1 to Alternative 2 b z P> z e^b e^bstdx ------------------+--------------------------------------------- Menial -BlueCol 0.09942 0.972 0.331 1.1045 1.3404 Menial -Craft -0.09382-0.962 0.336 0.9105 0.7585 Menial -WhiteCol -0.35316-3.011 0.003 0.7025 0.3533 Menial -Prof -0.77885-6.795 0.000 0.4589 0.1008 BlueCol -Menial -0.09942-0.972 0.331 0.9054 0.7461 BlueCol -Craft -0.19324-2.494 0.013 0.8243 0.5659 BlueCol -WhiteCol -0.45258-4.425 0.000 0.6360 0.2636 BlueCol -Prof -0.87828-8.735 0.000 0.4155 0.0752 Craft -Menial 0.09382 0.962 0.336 1.0984 1.3184 Craft -BlueCol 0.19324 2.494 0.013 1.2132 1.7671 Craft -WhiteCol -0.25934-2.773 0.006 0.7716 0.4657 Craft -Prof -0.68504-7.671 0.000 0.5041 0.1329 WhiteCol-Menial 0.35316 3.011 0.003 1.4236 2.8308 WhiteCol-BlueCol 0.45258 4.425 0.000 1.5724 3.7943 WhiteCol-Craft 0.25934 2.773 0.006 1.2961 2.1471 WhiteCol-Prof -0.42569-4.616 0.000 0.6533 0.2853 Prof -Menial 0.77885 6.795 0.000 2.1790 9.9228 Prof -BlueCol 0.87828 8.735 0.000 2.4067 13.3002 Prof -Craft 0.68504 7.671 0.000 1.9838 7.5264 Prof -WhiteCol 0.42569 4.616 0.000 1.5307 3.5053 ---------------------------------------------------------------- Variable: exper (sd=13.959364)

ThecircledtextcontainsresultsImayneedtoconfirmlater: Turningon"show/hide "revealstheprovenance:

twoway (line art_root2 art_root3 art_root4 art_root5 articles, /// lwidth(medium)), ytitle(number of Publications to the k-th Root) /// yscale(range(0 8.)) legend(pos(11) rows(4) ring(0)) /// caption(wf7-caption.do \ jsl 2008-04-09, size(vsmall)) Whenitcomestosavingyourwork,expectthingstogowrong,expectthatyou willdeletethewrongfileattheworstpossibletime,andexpectahosetobeleft onintheroomaboveyourcomputer.ifyouexpecttheworst,youmightbeable topreventit. 1.KennedyassassinationonNovember22,1963andthe9/11survey. 2.508KvolumesinobsoleteformatsatBritishMuseum.2MvideosatIU. 3.NeilArmstrong'swalkonthemoononJuly20,1969,thelostmoontapes, andpinkfloyd'sdarksideofthemoon. "afuzzygrayblobwadingthroughaninkwell" DarkSideoftheMoon

Acomicallyinvolved,complicatedinvention,laboriouslycontrivedtoperforma simpleoperation. Typeoffile Questionstoconsider Working Posted Archival Howdoyourecoverthefile? Redo Redo Download work oldwork file Whatisthecostofrecovery? Minor Potentially Alittle lotsofwork time Howlongareyourpreservingit? 13yrs. 310yrs. Forever Howdifficultisittopreserve? Trivial Tedious Veryhard Concernwithmedia/format? Minor Some Critical

Mozyandsimilarvendors,corporatemassstorage,localservers. 1.Sizeperdriveincreasedbyafactorofmorethan300,000. 2.Costpergigabytedecreasedbyafactorof7,000,000. 3.AshoeboxfullofportabledrivescanholdenoughIBMcardstofillBHsix timesover.inthepast3months,thischangedto12timesover...

1.Slowly. 2.Finishthelast5%ofthechange. 3.LikePennandTeller,masterafewcooltricks. 4.Don'tdoitunderdeadline. 1.Therearemanyviableworkflows. 2.ThekeyadvantageoftheWFbookisthatitiswrittendown. 3.AlanAcockwrote: o Noteveryonewillagreewithallof[Long's]suggestions. o IwillposttheannouncementofWorkflowonmydoorwiththefollowing note: I mgladtohelpanybodywhofollowedatleast25%oftheadvice Longprovides andbringsmetheirdofiles! 4.DoyoureallywanttospendyourtimerediscoveringthemistakesImade? Provenance:20090710;20090712;20090723;20090727;20090728;20090729;20090731;20091112;20091118