Day 16: The Ten Table Commandments
3 min readMay 23, 2020
The database is the modern day lab notebook. Tables in any coding environment, whether it’s MATLAB or Python, come with a steep learning curve. Through this tutorial I’ll introduce some of the best practices that have helped make me an efficient data analyst in my science lab.
Below are some rules you should try to follow. While these are written with MATLAB users in mind, they are generally applicable to storing data in tables.
- If you are choosing between making two tables or one, you should make one. You might think it’s a good idea to have two separate tables: control_table and mutant_table. Instead, put everything into one table and create a ‘treatment’ column which contains values ‘control’ and ‘mutant’.
- Make your table as readable as possible. Suppose each row of your data is either ‘Drug’ or ‘No drug’. You could create a column that says ‘Drug’ and have entries 1 or 0, but it’ll be much easier to future users of your table if you create a column that says ‘Treatment’ and then have ‘Drug’ or ‘No drug’ as entries in that column.
- Use categorical variables. Very few MATLAB users are familiar with this data type, but to unlock the features of a table you’ll want to use categorical variables. So if you plan to create a column from a cell with entries {‘red’, ’green’, ’red’, ’blue’}, then first convert it into a categorical variable by using the categorical function.
- Avoid storing the same (or highly similar) data in multiple places. This is generally a good rule for all data analysts. Avoid making columns that will contain every transformed or filtered version of your data. Instead, use the table to store parameters you can use to quickly transform or filter a column.
- Every time you create a table, specify the variable names in the same command. This can look tedious but it will save you many headaches later on and force you to plan ahead, create efficient and readable tables.
- “I’m planning to make a table today, when should I think about how I will use it?” Yesterday. By this, I mean you should plan ahead as much as possible what your data will say and then create a table that allows you to say it. This advice is in line with a recent eLife editorial “Tell Me a Story”, you should think about what insights your data offers, then arrange the data tables and figures.
- Use batch operations (like rowfun in MATLAB) whenever you can. When your data is in a table, there is no need to write for loops to analyze each row, instead use the rowfun function.
- Don’t just view a summary of your table, store it. I’ve seen people use summary functions to view a summary of their table, BUT they will also write custom scripts functions to store summary statistics in variables for later inspection. In MATLAB, this is unnecessary because you can save the summary as a structure: s = summary( myTable ). You can then access each field of the table and its statistics using s.column_name.Min, s.column_name.Median, etc.
- Tables open up a world of statistical modeling and analysis tools — use them! Many programming environments (MATLAB, seaborn, etc) offer statistical analysis tools that are fully ready for your table as input. For some examples in MATLAB, check out the documentation for scatter, boxplot, fitglm, anova1, and others!
- Get comfortable with BOTH logical and array indexing of tables. There are two ways to obtain rows in a table, for example you could select those rows for which mytable(Treatment==’drug’ & Age>20,:) or you could select them using mytable( find(Treatment==’drug’ & Age>20), : ). In the first case, you are indexing your table using an array of 0’s and 1’s. In the second case, you are indexing with an array of specific row numbers. With this simple example, it’s obvious that the logical is the more efficient route, but sometimes you may want to look at a specific row (first row, last row, a row with a weird outlier) — in those cases array indexing is the way to go.