Principles of Workflow in Data Analysis



Σχετικά έγγραφα
Principles of Workflow in Data Analysis

Comparing groups in regression models for binary outcomes

Test Data Management in Practice

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Προσομοίωση BP με το Bizagi Modeler

Numerical Analysis FMN011

The challenges of non-stable predicates

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ "ΠΟΛΥΚΡΙΤΗΡΙΑ ΣΥΣΤΗΜΑΤΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ. Η ΠΕΡΙΠΤΩΣΗ ΤΗΣ ΕΠΙΛΟΓΗΣ ΑΣΦΑΛΙΣΤΗΡΙΟΥ ΣΥΜΒΟΛΑΙΟΥ ΥΓΕΙΑΣ "

ΣΤΥΛΙΑΝΟΥ ΣΟΦΙΑ

MATHACHij = γ00 + u0j + rij

Εγκατάσταση λογισμικού και αναβάθμιση συσκευής Device software installation and software upgrade

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

Assalamu `alaikum wr. wb.

Πανεπιστήμιο Πειραιώς Τμήμα Πληροφορικής Πρόγραμμα Μεταπτυχιακών Σπουδών «Πληροφορική»

EE512: Error Control Coding

Στο εργαστήριο θα μελετηθούν: Διδάσκων: Γιώργος Χατζηπολλάς. Εργαστήριο 2: Εργαλεία Συστήματος UNIX. Ομάδες για παρουσίαση

Study of urban housing development projects: The general planning of Alexandria City

5.4 The Poisson Distribution.

SPEEDO AQUABEAT. Specially Designed for Aquatic Athletes and Active People

ΠΟΛΥΤΕΧΝΕΙΟ ΚΡΗΤΗΣ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

The Simply Typed Lambda Calculus

Architecture οf Integrated Ιnformation Systems (ARIS)

Advanced Subsidiary Unit 1: Understanding and Written Response

ΤΕΛΙΚΗ ΕΞΕΤΑΣΗ ΔΕΙΓΜΑ ΟΙΚΟΝΟΜΕΤΡΙΑ Ι (3ο Εξάμηνο) Όνομα εξεταζόμενου: Α.Α. Οικονομικό Πανεπιστήμιο Αθήνας -- Τμήμα ΔΕΟΣ Καθηγητής: Γιάννης Μπίλιας

Μηχανική Μάθηση Hypothesis Testing

ΜΟΝΤΕΛΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ

ΤΟ ΣΤΑΥΡΟΔΡΟΜΙ ΤΟΥ ΝΟΤΟΥ ΤΟ ΛΙΜΑΝΙ ΤΗΣ ΚΑΛΑΜΑΤΑΣ

Τμήμα Πολιτικών και Δομικών Έργων

Τ.Ε.Ι. ΔΥΤΙΚΗΣ ΜΑΚΕΔΟΝΙΑΣ ΠΑΡΑΡΤΗΜΑ ΚΑΣΤΟΡΙΑΣ ΤΜΗΜΑ ΔΗΜΟΣΙΩΝ ΣΧΕΣΕΩΝ & ΕΠΙΚΟΙΝΩΝΙΑΣ

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ "ΖΗΤΗΜΑΤΑ ΠΟΥ ΑΝΤΙΜΕΤΩΠΙΖΟΥΝ ΟΙ ΓΥΝΑΙΚΕΣ ΕΡΓΑΖΟΜΕΝΕΣ ΣΤΟ ΣΥΓΧΡΟΝΟ ΕΠΙΧΕΙΡΗΣΙΑΚΟ ΠΕΡΙΒΑΛΛΟΝ"

Οι αδελφοί Montgolfier: Ψηφιακή αφήγηση The Montgolfier Βrothers Digital Story (προτείνεται να διδαχθεί στο Unit 4, Lesson 3, Αγγλικά Στ Δημοτικού)

ΣΧΕΔΙΑΣΜΟΣ ΕΠΙΓΕΙΟΥ ΣΥΣΤΗΜΑΤΟΣ ΑΛΥΣΟΚΙΝΗΣΗΣ ΓΙΑ ΜΕΤΑΦΟΡΑ ΤΡΟΛΕΪ

derivation of the Laplacian from rectangular to spherical coordinates

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

ΕΘΝΙΚΗ ΣΧΟΛΗ ΗΜΟΣΙΑΣ ΙΟΙΚΗΣΗΣ

Example of the Baum-Welch Algorithm

C.S. 430 Assignment 6, Sample Solutions

ΠΑΡΑΜΕΤΡΟΙ ΕΠΗΡΕΑΣΜΟΥ ΤΗΣ ΑΝΑΓΝΩΣΗΣ- ΑΠΟΚΩΔΙΚΟΠΟΙΗΣΗΣ ΤΗΣ BRAILLE ΑΠΟ ΑΤΟΜΑ ΜΕ ΤΥΦΛΩΣΗ

ΣΧΕΔΙΑΣΜΟΣ ΔΙΚΤΥΩΝ ΔΙΑΝΟΜΗΣ. Η εργασία υποβάλλεται για τη μερική κάλυψη των απαιτήσεων με στόχο. την απόκτηση του διπλώματος

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

BUSINESS PLAN (Επιχειρηματικό σχέδιο)

TaxiCounter Android App. Περδίκης Ανδρέας ME10069

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

!"!"!!#" $ "# % #" & #" '##' #!( #")*(+&#!', & - #% '##' #( &2(!%#(345#" 6##7

Keystroke-Level Model

Reminders: linear functions

Elements of Information Theory

Section 8.3 Trigonometric Equations

Χαρτογράφηση θορύβου

Πανεπιστήµιο Μακεδονίας Τµήµα ιεθνών Ευρωπαϊκών Σπουδών Πρόγραµµα Μεταπτυχιακών Σπουδών στις Ευρωπαϊκές Πολιτικές της Νεολαίας

ΓΗΠΛΧΜΑΣΗΚΖ ΔΡΓΑΗΑ ΑΡΥΗΣΔΚΣΟΝΗΚΖ ΣΧΝ ΓΔΦΤΡΧΝ ΑΠΟ ΑΠΟΦΖ ΜΟΡΦΟΛΟΓΗΑ ΚΑΗ ΑΗΘΖΣΗΚΖ

ΑΓΓΛΙΚΑ Ι. Ενότητα 7α: Impact of the Internet on Economic Education. Ζωή Κανταρίδου Τμήμα Εφαρμοσμένης Πληροφορικής

PHOS π 0 analysis, for production, R AA, and Flow analysis, LHC11h

Other Test Constructions: Likelihood Ratio & Bayes Tests

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ ΠΕΡΙΒΑΛΛΟΝΤΟΣ. Πτυχιακή εργασία

Πανεπιστήμιο Δυτικής Μακεδονίας. Τμήμα Μηχανικών Πληροφορικής & Τηλεπικοινωνιών. Ηλεκτρονική Υγεία

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

ΚΑΘΟΡΙΣΜΟΣ ΠΑΡΑΓΟΝΤΩΝ ΠΟΥ ΕΠΗΡΕΑΖΟΥΝ ΤΗΝ ΠΑΡΑΓΟΜΕΝΗ ΙΣΧΥ ΣΕ Φ/Β ΠΑΡΚΟ 80KWp

IIT JEE (2013) (Trigonomtery 1) Solutions

Δίκτυα Επικοινωνιών ΙΙ: OSPF Configuration

ΟΙΚΟΝΟΜΟΤΕΧΝΙΚΗ ΑΝΑΛΥΣΗ ΕΝΟΣ ΕΝΕΡΓΕΙΑΚΑ ΑΥΤΟΝΟΜΟΥ ΝΗΣΙΟΥ ΜΕ Α.Π.Ε

ΣΟΡΟΠΤΙΜΙΣΤΡΙΕΣ ΕΛΛΗΝΙΔΕΣ

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Σημασιολογικός Ιστός (Semantic Web) - XML

Example Sheet 3 Solutions

Partial Trace and Partial Transpose

VBA ΣΤΟ WORD. 1. Συχνά, όταν ήθελα να δώσω ένα φυλλάδιο εργασίας με ασκήσεις στους μαθητές έκανα το εξής: Version ΗΜΙΤΕΛΗΣ!!!!

ΕΙΣΑΓΩΓΗ ΣΤΗ ΧΡΗΣΗ ΤΟΥ ΣΤΑΤΙΣΤΙΚΟΥ ΠΑΚΕΤΟΥ MINITAB 12.0

ΔΙΑΤΜΗΜΑΤΙΚΟ ΠΡΟΓΡΑΜΜΑ ΜΕΤΑΠΤΥΧΙΑΚΩΝ ΣΠΟΥΔΩΝ ΣΤΗ ΔΙΟΙΚΗΣΗ ΕΠΙΧΕΙΡΗΣΕΩΝ ΘΕΜΕΛΙΩΔΗΣ ΚΛΑΔΙΚΗ ΑΝΑΛΥΣΗ ΤΩΝ ΕΙΣΗΓΜΕΝΩΝ ΕΠΙΧΕΙΡΗΣΕΩΝ ΤΗΣ ΕΛΛΗΝΙΚΗΣ ΑΓΟΡΑΣ

Stata Session 3. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

Οδηγίες χρήσης. Registered. Οδηγίες ένταξης σήματος D-U-N-S Registered στην ιστοσελίδα σας και χρήσης του στην ηλεκτρονική σας επικοινωνία

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ΤΕΧΝΙΚΕΣ ΔΙΑΓΝΩΣΗΣ ΤΗΣ ΝΟΣΟΥ ΑΛΤΣΧΑΙΜΕΡ ΜΕ FMRI

Calculating the propagation delay of coaxial cable

ΕΘΝΙΚΗ ΣΧΟΛΗ ΔΗΜΟΣΙΑΣ ΔΙΟΙΚΗΣΗΣ ΙΓ' ΕΚΠΑΙΔΕΥΤΙΚΗ ΣΕΙΡΑ

Δείκτες που πρέπει να παρακολουθεί ο ecommerce Manager ενός online Φαρμακείου. KPIs for Online Pharmacies

ΟΙ ΤΟΠΙΚΕΣ ΟΙΚΟΝΟΜΙΕΣ ΚΑΙ Η ΔΙΑΣΤΑΣΗ ΤΟΥ ΦΥΛΟΥ ΣΤΗΝ ΑΓΟΡΑ ΕΡΓΑΣΙΑΣ: Η ΠΕΡΙΠΤΩΣΗ ΤΟΥ ΝΟΜΟΥ ΜΕΣΣΗΝΙΑΣ

Fractional Colorings and Zykov Products of graphs

ΕΙΣΑΓΩΓΗ ΣΤΟN ΠΡΟΓΡΑΜΜΑΤΙΣΜΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ Η/Υ ΚΑΙ ΠΛΗΡΟΦΟΡΙΚΗΣ

ΓΕΩΜΕΣΡΙΚΗ ΣΕΚΜΗΡΙΩΗ ΣΟΤ ΙΕΡΟΤ ΝΑΟΤ ΣΟΤ ΣΙΜΙΟΤ ΣΑΤΡΟΤ ΣΟ ΠΕΛΕΝΔΡΙ ΣΗ ΚΤΠΡΟΤ ΜΕ ΕΦΑΡΜΟΓΗ ΑΤΣΟΜΑΣΟΠΟΙΗΜΕΝΟΤ ΤΣΗΜΑΣΟ ΨΗΦΙΑΚΗ ΦΩΣΟΓΡΑΜΜΕΣΡΙΑ

Το κοινωνικό στίγμα της ψυχικής ασθένειας

Η ΨΥΧΙΑΤΡΙΚΗ - ΨΥΧΟΛΟΓΙΚΗ ΠΡΑΓΜΑΤΟΓΝΩΜΟΣΥΝΗ ΣΤΗΝ ΠΟΙΝΙΚΗ ΔΙΚΗ

ΠΕΡΙΕΧΟΜΕΝΑ. Κεφάλαιο 1: Κεφάλαιο 2: Κεφάλαιο 3:

Inverse trigonometric functions & General Solution of Trigonometric Equations

Η ΑΥΛΗ ΤΟΥ ΣΧΟΛΕΙΟΥ ΠΑΙΧΝΙΔΙΑ ΣΤΗΝ ΑΥΛΗ ΤΟΥ ΣΧΟΛΕΙΟΥ

HOMEWORK#1. t E(x) = 1 λ = (b) Find the median lifetime of a randomly selected light bulb. Answer:

ΓΕΩΠΟΝΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΤΜΗΜΑ ΑΓΡΟΤΙΚΗΣ ΟΙΚΟΝΟΜΙΑΣ & ΑΝΑΠΤΥΞΗΣ

ΜΑΘΗΜΑΤΙΚΗ ΠΡΟΣΟΜΟΙΩΣΗ ΤΗΣ ΔΥΝΑΜΙΚΗΣ ΤΟΥ ΕΔΑΦΙΚΟΥ ΝΕΡΟΥ ΣΤΗΝ ΠΕΡΙΠΤΩΣΗ ΑΡΔΕΥΣΗΣ ΜΕ ΥΠΟΓΕΙΟΥΣ ΣΤΑΛΑΚΤΗΦΟΡΟΥΣ ΣΩΛΗΝΕΣ ΣΕ ΔΙΑΣΤΡΩΜΕΝΑ ΕΔΑΦΗ

Homework 3 Solutions

encouraged to use the Version of Record that, when published, will replace this version. The most /BCJ BIOCHEMICAL JOURNAL

Hancock. Ζωγραφάκης Ιωάννης Εξαρχάκος Νικόλαος. ΕΠΛ 428 Προγραμματισμός Συστημάτων

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Srednicki Chapter 55

GAUSS-LAGUERRE AND GAUSS-HERMITE QUADRATURE ON 64, 96 AND 128 NODES

Οδηγός εκκαθάρισης ιστορικού cookies περιηγητή

Transcript:

IndianaUniversity PrinciplesofWorkflowin DataAnalysis ScottLong 1.Acoordinatedframeworkforconductingdataanalysis 2.WFinvolvescoordinatedproceduresfor: o Planning,organizinganddocumentingresearch o Cleaningdata o Analyzingdata o Presentingresults o Backingupandarchivingmaterials November2010 1.YourWFmightbe: A.Plannedandcarefullyorchestrated. B.Adhoc,piecemeal,developedinreactiontomistakes. 2.YoucanimproveyourWFwithamodestinvestmentoftime. A.Thelessexperienceyouhave,theeasieritis. B.Itwillsaveyoutimeandmakeyouabetterdataanalyst. 1.Replication o Replicationisessentialforgoodscience. o Aneffectiveworkflowisessentialforreplication. 2.Gettingtherightanswers o Retractionsareembarrassingandcanendcareers. 3.Time o Scienceisavoraciousinstitution. o Aneffectiveworkflowmakesyoumoreefficient. 4.Errorsareinevitable;aneffectiveworkflowhelpsyoufindandfixthem. 5.GainingtheIUadvantage Thepublicationof[TheWorkflowofDataAnalysis UsingStata]mayevenreduceIndiana scomparative advantageofproducinghotshotquantphdsnowthat gradstudentselsewherecanvicariouslybenefitfrom thisimportantaspectofthetrainingthere. Gabriel Rossmanonhisblog 1.Easythings:consultingoneasythings,insteadofhardthings. 2.Incorrectresultswithclever explanations. 3.Adissertationdelayed18monthstodeterminewhyresultschanged. 4.Irreproducibleresultsfromasingle,743linedofile. 5.Analyzingthewrongdataset: Thedatasetsareexactlythesameexcept thatichangedthemarriedvariable. 6.AnalyzingthewrongvariablewhilewritinganNASreport. 7.Miscodedgenesthatdelayedprogressinastudyofalcholism. 8.Collaborationsthatmultiplythewaysthingscangowrong. 9.Misleadingorambiguousoutputsuchas...

Example 1: definitely a problem in a $3M study. tabulate female sdchild_v1 R is Q15 Would let X care for children female? Defintel Probably Probably Definitel Total ----------+---------------------------------------------+---------- 0Male 41 99 155 197 492 1Female 73 98 156 215 542 ----------+---------------------------------------------+---------- Total 114 197 311 412 1,034 Example 2: which number is which?. tab occ ed, row Years of education Occupation 3 6 7 8 9 10 11 12 13 Total -----------+------------------------------------------------------------------------------- --------------------+---------- Menial 0 2 0 0 3 1 3 12 2 31 0.00 6.45 0.00 0.00 9.68 3.23 9.68 38.71 6.45 100.00 -----------+------------------------------------------------------------------------------- --------------------+---------- BlueCol 1 3 1 7 4 6 5 26 7 69 1.45 4.35 1.45 10.14 5.80 8.70 7.25 37.68 10.14 100.00 -----------+------------------------------------------------------------------------------- --------------------+---------- Craft 0 3 2 3 2 2 7 39 7 84 0.00 3.57 2.38 3.57 2.38 2.38 8.33 46.43 8.33 100.00 -----------+------------------------------------------------------------------------------- --------------------+---------- WhiteCol 0 0 0 1 0 1 2 19 4 41 0.00 0.00 0.00 2.44 0.00 2.44 4.88 46.34 9.76 100.00 -----------+------------------------------------------------------------------------------- --------------------+---------- Example 3: good software doing things badly. logit tenure i.female i.female#c.articles i.male i.male#c.articles, nocons note: 0.male#c.articles omitted because of collinearity note: 1.male#c.articles omitted because of collinearity ------------------------------------------------------------------------------ tenure Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female -2.473265.1351561-18.30 0.000-2.738166-2.208364 female# c.articles 0.0980976.0098808 9.93 0.000.0787316.1174636 1.0421485.0098962 4.26 0.000.0227524.0615447 1.male -2.693147.1170916-23.00 0.000-2.922642-2.463651 male# c.articles 0 (omitted) 1 (omitted) ------------------------------------------------------------------------------ DidStataCorpreadtheWFbook? 1.Tacitknowledge 2.Heavylifting 3.Timetopractice 1.Explicitknowledgeisthestuffoftextbooksandarticles. 2.Tacitknowledgeisimplicitandundocumented(MichaelPolanyi). A.Peopleareunawareoftheiressentialtacitknowledge. o HenryBessemer spatentformakingsteeldidn twork(1855) B.Tacitknowledgeistransferred atthebench. o Personalcomputersimpedethetransferoftacitknowledge. Data analysis includes a lot of heavy lifting Thereality,ofcourse,todayisthatifyoucomeupwithagreatideayou don'tgettogoquicklytoasuccessfulproduct.there'salotof undifferentiatedheavyliftingthatstandsbetweenyourideaandthat success. JeffBezos,amazon.com

TheWorkflowofDataAnalysisUsingStata 1.MakestacitknowledgeaboutWFexplicit. 2.Itdealswithalotofundifferentiatedheavylifting. 3.Itcontainsspecificsonthegeneralissuesdiscussedtoday. 4.ThebookfocusesontoolsinStata,buttheprinciplesapplybroadly. ironicaloptimism Theuniversalaptitudeforineptitudemakesanyhumanaccomplishment anincrediblemiracle.dr.johnpaulstapp replication 1.AneffectiveWFfacilitatesreplication. 2.Youmustplanforreplicationatthestartofaproject. 3.Disciplinesareincreasinglyconcernedwithreplicability. o ArticlesinPoliticalScience,Economics,Sociologyandotherfields. 4.Askyourself: o Areyourdofilesandlogfilesreadyforpublicdisplay? o Willtheyproduceexactlythesameresultsasyouhavepublished? 1.Thecurseofdimensionality:10minordecisions,leadsto1,024reasonable waystocreateyourdata. o Wheretotruncateavariable. o TheseedfortheRNgenerator. o Creatingascalewithpartialmissingdata. o Whichcasestokeepforanalysis. o Howtocodeeducation? o Whatvaluestoassignincomegreaterthan$200,000? o Andsoon... Decisions in the path to analysis: the choices that could be made Decisions in the path to analysis: the choices made

2.Documentation:Replicationshouldinvolveretrievingdocumentation,not tryingtorememberwhatyoudid. 3.Changingsoftware:2weeksofsleeplessnightsduetoversionvariation.This isparticularlydifficultwhenthereisanactiveusercommunity. 4.Lostfiles:corrupted,lost,unreadable,obsolete,orambiguousfiles. Criteria o Ifyourprogramisnotcorrect,thennothingelsematters. OliveiraandStewart o Completingworkquicklygivenaccuracyandreplicability. o Tensionbetweenworkingquicklyandworkingcarefully. o Don'trepeatedlyandinconsistentlydecidehowtodothings. o Standardizationmakesiteasiertofindmistakes. o Automatedprocedurespreventmistakesandarefaster. o Drukker'sDictim:Nevertypeanythingthatyoucanobtainfromasaved result.(didtheauthorsofmarginsthinkaboutthis?) o Themorecomplicatedyourproceduresthemorelikelyyouwillmake mistakesorabandonyourplan. o Yourworkflowshouldreflectthewayyou liketowork. o Ifyouignoreyourprocedures,itisnotagoodWF. o Differentprojectsrequiredifferentworkflows. o Collaborationmakesitmoredifficulttohaveaneffective,efficientand replicableworkflow. o Why?And,whycan ttheydoitjustlikeme? o Everyproblemyoucanhaveworkingbyyourselfismultipled.

1. Agreeduponstandards 2. Explicitcoordination 3. Enforcementofstandards 4. Asenseofhumor Steps o Datamustbeaccurate. o Variablesmustbecarefullynamedandlabeled. o Thistakes90%ofthetime,unlessyouhurry. o Estimatemodelsandcreategraphs. o Oftenthesimplestpartoftheworkflow. o Incorporateoutputintoyourpresentation. o Maintaintheprovenanceofresults. o Makeeffectivepresentations. o Backingupandarchiving:preservingthebitsandthecontent. $2,000toget1variablefroman archived file. o Replicationisimpossiblewithoutyourdataanddofiles. o "Today'snoiseistomorrow'sknowledge."DavidClemmer

Tasks The ideal BlauandDuncan(1967)TheAmericanOccupationalStructure o Allanalyseswerespecified9monthsbeforeoutputwasreceived. o Thebookwaswrittenbasedentirelyonthoseanalyses. o Noneofthelaterbookswrittenwithfullaccesstothedatawereasgood. Issues in planning 1.Aplanisaremindertostayontrack,finishtheproject,andpublishresults. Work.Finish.Publish.MichaelFaraday ssigninhislab 2.Alittleplanninggoesalongwayandalmostalwayssavestime. 3.Planningincludes: o Generalgoals,publishingplans,andfirmdeadlines. o Divisionoflaborandaccountability. o Proposalfordataconstruction:names,labels,formats. o Proceduresforhandlingmissingdata. o Anticipatedanalyses. oguidelinesandresponsibilityfordocumentation. oproceduresandscheduleforbackingupandarchivingmaterials. 1.Organizationismovtivatedbytheneedto: o Findthings o Avoidduplication 2.Itrequiresexplicit,consistentdecisionsaboutnamingandstoring things. 3.Organization: o Helpsyouworkfaster o Rewardsconsistencyanduniformity o Organizationiscontagious 1.Youcan'tfindafileandthinkyoudeletedit. 2.Youfindmultipleversionsofafileanddon'tknowwhichiswhich. 3.Youandacolleagueareworkingondifferentversionsofthesamepaper. Youchangedwhatshechangedandnowyouhavethreeversionsofthe paper. 4.Youneedthefinalversionofthepaperthewassubmittedforreview,but youhavetwo(or16)fileswith"final"inthename. o final_report_v16.docx o NSF_science_report20101021.docx 1.Itiseasiertocreateafilethantofindafile. 2.Itiseasiertofindafilethantoknowwhatisinthefile. 3.Withdiskspacesocheap,itistemptingtocreatealotoffiles.

Organizing: a standard directory structure for all projects \WF project \- History \2009-03-06 project directory created \- Hold then delete \- Pre posted \- To clean \Documentation \Posted \Resources \Text \- Versions \Work \- To do Organizing: wfsetupsingle.bat makes it easy REM workflow talk 2 \ wfsetupsingle.bat jsl 2009-07-12 REM directory structure for single person. FOR /F "tokens=2,3,4 delims=/- " %%a in ("%DATE%") do set CDATE=%%c-%%a-%%b md "- History\%cdate% project directory created" md "- Hold then delete " md "- Pre posted " md "- To clean" md "Documentation" md "Posted" md "Resources" md "Text\- Versions\" md "Work\- To do" Forexample,abatchfilemakescreatinguniformdirectorieseasy. capture log close log using wftalk-example, replace text // program: wftalk-example.do // task: // project: // author: jsl \ 2010-07-27 version 11 clear all set linesize 80 local tag "wftalk-example.do jsl 2010-07-27" // #1 // Description of task 1 // #2 // Description of task 2 log close exit Templatesmakethisstructureeasytouse. Anycoloryouwantaslongasitisblack. 1.Long'sLaw:Itisalwaysfastertodocumentittodaythantomorrow. Corollary1:Nobodylikestowritedocumentation. Corollary2:Nobodyregretshavingwrittendocumentation. Haveyoueversaid:"Drat,thisprogramhastoomanycomments." 2.Documentationoccursonmanylevels:logs,metadata,comments,names. 3.Withoutdocumentation,replicationisvirtuallyimpossible,mistakesare morelikely,andworktakeslonger. 4.Themorecodifiedthefieldthegreatertheemphasisondocumentation. A.TheResearchLogbytheAmericanChemicalSociety. B.Lossoftenureforanalteredresearchlog.

1.Doittoday. 2.Checkittomorrowornextweek:italwaysmakessensetoday. 3.Keepupwithdocumentationbytyingittoeventsintheproject. 4.Includefulldatesandnames. Arealexample... 1.Executioninvolvescarryingouttaskswithineachstep. 2.Effectiveexecutionrequirestherighttools. o Software a.texteditor b.filemanager c.statisticalsoftware d.macroprogram(evenifonlytoinserttimestamps) e.wordprocessor o Hardware:display,storage,memory,CPU 3.Planningisprobablymoreimportantthancomputingpower. For example Cornell 1975: the entire computing infrastructure IBM370with240Kmemory Winchesterdriveswith3MBstorage Costofcomputing$1,000,000. Meantimetodegree7.6years. Indiana 2009: a disposable PC Asus1000HEwith2GBmemory FreeAgentwith1TBstorage 10,000timesmore 350,000timesmore... Costofcomputing$400(2,500timesless). Meantimetodegree7.6years. 1.Randomlydivideyourselvesintotwogroups. o Thecomputerscancomputewhenevertheywantto. o Theplannerscanonlycomputefortwosixhoursessionsaweek. 2.Whofinishesfirst?

Principlesforacomputingworkflow 1.Dualworkflow:keepdatamanagementanddataanalysisseparate. 2.Runorder:namefilessothatiftheyarereruninalphabeticalorder,youwill produceexactlythesameresults. 3.Postingprincipleforsharingresults(definedlater) Datamanagement==> <==Dataanalysis Run order and a dual workflow Datamanagement Dataanalysis data01.do stat01a.do data02v2.do stat01b.do data03.do stat01cv2.do data03-1.do data03-2.do stat02a.do data04.do stat02a1.do stat02b.do postingprinciple Thepostingprincipleisdefinedbytworules: 1.Thesharerule:Onlyshareresultsafterthefilesareposted. 2.Thenochangerule:Onceafileisposted,neverchangeit. stat03av2.do stat03b.do stat03c.do stat03c1.do stat03c2v2.do stat03d.do 1.Theyareselfcontained 2.Theyincludeversioncontrol(version 11.1) 3.Theyexcludedirectoryinformation(whichmightchange) 4.Theyexplicitlysetseedsforrandomnumbers 5.Theyrequirethatyouarchiveuserwrittenadofiles Simplyput:Itshouldrunonanothercomputeratalaterdatewithoutchanges. 1.Lotsofthoughtfulcomments 2.Alignment,indentationandspacing 3.Shortlineswithoutwrapping 4.Noambiguousabbreviations: l a l in 1/3

+----------------+ Key ---------------- frequency row percentage +----------------+ Years of education Occupation 3 6 7 8 9 10 11 12 13 Total -----------+---------------------------------------------------------------------------- -----------------------+---------- Menial 0 2 0 0 3 1 3 12 2 31 0.00 6.45 0.00 0.00 9.68 3.23 9.68 38.71 6.45 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- BlueCol 1 3 1 7 4 6 5 26 7 69 1.45 4.35 1.45 10.14 5.80 8.70 7.25 37.68 10.14 100.00 -----------+---------------------------------------------------------------------------- -----------------------+---------- Craft 0 3 2 3 2 2 7 39 7 84 0.00 3.57 2.38 3.57 2.38 2.38 1.Muchofdataanalysisinvolvesrepetitivetasks. 2.Repetitioninviteserrors. 3.Automationisfaster,andlesserrorprone. A.macros:wordsthatrepresentstringsoftext. B.loops:multipleexecutionofthesamecommands. C.returnedresults:avoidingtypingthevalueofanystatisticalresult. D.matrices:holdandsummarizekeyresults. E.adofiles:writeprogramsthatdowhatyouwant. F.me.hlp:don tkeeplookingupthesamethings.forexample, help me InStata,type: findit snag snagcollectsdozensorhundredsofresultstomakethemeasiertodigest. o Thestandardoutputisusedtoverifytheresults. o The snagged summaryletsyoudiscoverwhatyouwant. o Anyoneusingmarginsknowswhythisisnecessary. Example:ownsexandownsexucausedweeksofconfusion.

Cleaning 1a: finding an error with a graph Cleaning 1b: reversing the graph Cleaning 2: remembering a coding decision Cleaning 3: understanding the substantive process Cleaning 4: avoiding expensive mistakes

mlogit (N=337): Factor Change in the Odds of occ Variable: white (sd=.27642268) Odds comparing Alternative 1 to Alternative 2 b z P> z e^b e^bstdx ------------------+--------------------------------------------- Menial -BlueCol -1.23650-1.707 0.088 0.2904 0.7105 Menial -Craft -0.47234-0.782 0.434 0.6235 0.8776 Menial -WhiteCol -1.57139-1.741 0.082 0.2078 0.6477 Menial -Prof -1.77431-2.350 0.019 0.1696 0.6123 BlueCol -Menial 1.23650 1.707 0.088 3.4436 1.4075 BlueCol -Craft 0.76416 1.208 0.227 2.1472 1.2352 BlueCol -WhiteCol -0.33488-0.359 0.720 0.7154 0.9116 BlueCol -Prof -0.53780-0.673 0.501 0.5840 0.8619 Craft -Menial 0.47234 0.782 0.434 1.6037 1.1395 Craft -BlueCol -0.76416-1.208 0.227 0.4657 0.8096 Craft -WhiteCol -1.09904-1.343 0.179 0.3332 0.7380 Craft -Prof -1.30196-2.011 0.044 0.2720 0.6978 WhiteCol-Menial 1.57139 1.741 0.082 4.8133 1.5440 WhiteCol-BlueCol 0.33488 0.359 0.720 1.3978 1.0970 WhiteCol-Craft 1.09904 1.343 0.179 3.0013 1.3550 WhiteCol-Prof -0.20292-0.233 0.815 0.8163 0.9455 Prof -Menial 1.77431 2.350 0.019 5.8962 1.6331 Prof -BlueCol 0.53780 0.673 0.501 1.7122 1.1603 Prof -Craft 1.30196 2.011 0.044 3.6765 1.4332 Prof -WhiteCol 0.20292 0.233 0.815 1.2250 1.0577 ---------------------------------------------------------------- Variable: ed (sd=2.9464271) Odds comparing Alternative 1 to Alternative 2 b z P> z e^b e^bstdx ------------------+--------------------------------------------- Menial -BlueCol 0.09942 0.972 0.331 1.1045 1.3404 Menial -Craft -0.09382-0.962 0.336 0.9105 0.7585 Menial -WhiteCol -0.35316-3.011 0.003 0.7025 0.3533 Menial -Prof -0.77885-6.795 0.000 0.4589 0.1008 BlueCol -Menial -0.09942-0.972 0.331 0.9054 0.7461 BlueCol -Craft -0.19324-2.494 0.013 0.8243 0.5659 BlueCol -WhiteCol -0.45258-4.425 0.000 0.6360 0.2636 BlueCol -Prof -0.87828-8.735 0.000 0.4155 0.0752 Craft -Menial 0.09382 0.962 0.336 1.0984 1.3184 Craft -BlueCol 0.19324 2.494 0.013 1.2132 1.7671 Craft -WhiteCol -0.25934-2.773 0.006 0.7716 0.4657 Craft -Prof -0.68504-7.671 0.000 0.5041 0.1329 WhiteCol-Menial 0.35316 3.011 0.003 1.4236 2.8308 WhiteCol-BlueCol 0.45258 4.425 0.000 1.5724 3.7943 WhiteCol-Craft 0.25934 2.773 0.006 1.2961 2.1471 WhiteCol-Prof -0.42569-4.616 0.000 0.6533 0.2853 Prof -Menial 0.77885 6.795 0.000 2.1790 9.9228 Prof -BlueCol 0.87828 8.735 0.000 2.4067 13.3002 Prof -Craft 0.68504 7.671 0.000 1.9838 7.5264 Prof -WhiteCol 0.42569 4.616 0.000 1.5307 3.5053 ---------------------------------------------------------------- Variable: exper (sd=13.959364) 1. Takelotsofclassesinstatistics. 2. Findexemplars;don trediscoverthewheel;don tdoit yourway. 1.Contentandmethodsaresubstantive,disciplinarydecisions. 2.Presentationsandpreservationofprovenanceareuniversal. ThecircledtextcontainsresultsImayneedtoconfirmlater: Turningon"show/hide "revealstheprovenance: twoway (line art_root2 art_root3 art_root4 art_root5 articles, /// lwidth(medium)), ytitle(number of Publications to the k-th Root) /// yscale(range(0 8.)) legend(pos(11) rows(4) ring(0)) /// caption(wf7-caption.do \ jsl 2008-04-09, size(vsmall))

Whenitcomestosavingyourwork,expectthingstogowrong,expectthatyou willdeletethewrongfileattheworstpossibletime,andexpectahosetobeleft onintheroomaboveyourcomputer.ifyouexpecttheworst,youmightbeable topreventit. 1.KennedyassassinationonNovember22,1963andthe9/11survey. 2.508KvolumesinobsoleteformatsatBritishMuseum.2MvideosatIU. 3.NeilArmstrong'swalkonthemoononJuly20,1969,thelostmoontapes, andpinkfloyd'sdarksideofthemoon. "afuzzygrayblobwadingthroughaninkwell" DarkSideoftheMoon

1. Installtheprogam 2. Dropfilesintothefolder 3. RetrievethemfromanymachinewithDropbox 4. Havesharedfoldersforcollaboration 5. Avoidsendingattachmentsevenforonetimefileexchanges Dropboxandsimilarservices,enterprisemassstorage,localservers. 1.Sizeperdriveincreasedbyafactorofmorethan300,000. 2.Costpergigabytedecreasedbyafactorof7,000,000. 3.AshoeboxfullofportabledrivescanholdenoughIBMcardstofilla30M cubicfootbuilding;60mcubicfeetnextmonth.withcompression 1.Slowly,systematically,throughtfully. 2.Finishthelast5%ofthechange. 3.LikePennandTeller,masterafewcooltricks. 4.Don'tdoitunderdeadline. 1.Therearemanyviableworkflows. 2.ThekeyadvantageoftheWFbookisthatitiswrittendown. 3.AlanAcockwrote: o Noteveryonewillagreewithallof[Long's]suggestions. o IwillposttheannouncementofWorkflowonmydoorwiththefollowing note: Iamgladtohelpanybodywhofollowedatleast25%oftheadvice Longprovides andbringsmetheirdofiles! 4.DoyoureallywanttospendyourtimerediscoveringthemistakesImade?