At EISM, we work with our clients to create and productionize data science using one easy and intuitive environment, enabling every stakeholder in the data science process to focus on what they do best and producing world-class business results. This includes applying deep learning to tabular data, which is also referred to as structured data.
The term tabular or structured data refers to data that resides in a fixed field within a file or record. Structured data is typically stored in a relational database (RDBMS). It can consist of numbers and text, and sourcing can happen automatically or manually, as long as it’s within an RDBMS structure. It depends on the creation of a data model, defining what types of data to include and how to store and process it. For present purposes, we will quickly review the two most common forms in which structured data is stored.
Categorical Variables (Embeddings)
The key getting the most out of deep learning for tabular data is to use embeddings for categorical variables. This allows for relationships between categories to be registered. For example, maybe weekends have similar data, and maybe Friday’s data is somewhere in between a weekend day and the other days of the week. Another example is zip codes. There may be patterns for zip codes that are geographically close to each other, and for zip codes that are of similar socio-economic status.
One way to capture these multi-dimensional relationships between categories is to use something called “embeddings”. This is the same concept that is used with word embeddings, such as Word2Vec. A 3-dimensional version of a word embedding might look like:
Notice in the graph above that the first dimension is capturing something related to being a dog, while the second dimension captures the age of dogs and cats (puppies and kittens). This example is very simple and can be thought up conceptually by hand, but in the real world of our jobs we can use machine learning to find the best representations.
With the automated representations of machine learning, our work is made a lot easier and we’re able to extract value from relational or structured data, which is one of the most common ways in which to find data. To get a better visualization, see the graph above, where all of the animals are characterized by similar vectors, whereas an “avalanche”, which is an entirely different concept, is characterized by a very distant vector.
SQL, or Structured Query Language, is the standard communication language used to speak with relational databases. SQL databases allow users to avoid writing custom code to interact with data sets, and instead use a well defined and standardized language that is transferable between many different database dialects.
Despite its many years of dominance, SQL has been fading as the de-facto choice for software applications that need to work with large datasets at low latencies. In the age of big data, developers need to be considerate of the limitations of SQL especially when processing large volumes of information. This is especially important in the modern age of information, where companies are collecting more data than ever before and struggling to derive meaningful insight from it.
Still, many companies and people continue to work with SQL. The good news is that, contrary to a common misperception, many of the most recent advances in deep learning can be applied to data in an SQL database.
If you work with data in tabular or structured form, whether in Excel or with SQL, and are interested in applying deep learning to it as part of an end-to-end data science process, please contact us. We can quickly assess your situation and give you some tips or help. One of our salespersons is available to talk to you today. Click here to set up a Zoom meeting or feel free to contact us via email or phone.