When it comes to language learning, focusing on phrases and sentences rather than isolated words can make a significant difference. While memorizing vocabulary lists might seem like a straightforward approach, it often leaves learners struggling to use those words in real-life situations. Words alone rarely convey complete meaning; context is crucial. By learning phrases and sentences, you naturally absorb grammar, word order, and common expressions, making your speech sound more natural and fluent.
For example, knowing the word “book” is helpful, but learning the phrase “I’d like to book a table” is far more practical. Phrases provide ready-made building blocks for conversation, reducing the mental effort needed to construct sentences from scratch. This approach also helps with pronunciation and intonation, as you practice speaking in chunks rather than isolated syllables.
Moreover, sentences and phrases expose you to cultural nuances and idiomatic expressions that single words cannot convey. This leads to better comprehension when listening or reading, and more confidence when speaking. In summary, prioritizing phrases and sentences accelerates your ability to communicate effectively, making language learning more enjoyable and efficient.
Below are some of the anki decks that can be used:
Deutsch:
German Sentences
Part 1 - A1 and A2: https://ankiweb.net/shared/info/785874566
Part 2 - B1 : https://ankiweb.net/shared/info/17323417
Part 3 - B2-C1 : https://ankiweb.net/shared/info/944971572
German 7000 Intermediate/Advanced Sentences w/ Audio
Part 1 : https://ankiweb.net/shared/info/1125602705
| Compound Noun | Meaning | | ————— | —————- | | Krankenhaus | hospital | | Zahnarzt | dentist | | Augenarzt | eye doctor | | Kopfschmerzen | headache | | Rückenschmerzen | back pain | | Körperpflege | body care | | Krankenkasse | health insurance | | Herzschlag | heartbeat | | Blutdruck | blood pressure | | Hausarzt | family doctor | | Notaufnahme | emergency room | | Krankenwagen | ambulance |
🎓 School & Learning (Schule und Lernen)
| Compound Noun | Meaning | | —————– | ——————- | | Schulbuch | school book | | Lehrerzimmer | teacher’s room | | Sprachschule | language school | | Hausaufgabe | homework | | Klassenarbeit | class test | | Stundenplan | schedule/timetable | | Schultasche | school bag | | Schulweg | way to school | | Schülerausweis | student ID card | | Schulzeit | school time | | Unterrichtsstunde | lesson | | Schulanfang | beginning of school |
German/Deutsch is divided into different cases such that it would be easier to build on the language slowly but surely. It adds different parts required for fluency–or at least be able to make some sentences. The cases are as follows:
Nominative
Akklustiv
Dativ
Gentiv (A2)
Nominative
Maskuline
Femiline
Neutrum
Plural
bestmitte Artikel (the)
der
die
das
die
unbestmitte Artikel (one)
ein
eine
ein
-
negative Artikel (no)
kein
keine
kein
keine
Possessive Artikel (my)
mein
meine
mein
meine
beispiele (Examples):
Was ist das? Wei ist das?
Das ist der Tische.
unbestmitte Artikel: Das ist ein Tische.
negative Artikel: Das ist kein Tische.
Possessive Artikel: Das ist mein Tische.
Das ist die Banane.🍌
unbestmitte Artikel: Das ist eine Banane.
negative Artikel:: Das ist keine Banane.
Possessive Artikel:: Das ist meine Banane.
Das ist das Handy. 📲🤳
unbestmitte Artikel: Das ist ein Handy.
negative Artikel:: Das ist kein Handy.
Possessive Artikel:: Das ist mein Handy.
Das sind das Zeitungen. 🗞️📰
unbestmitte Artikel: (This is plural. Therefore, it has no )
negative Artikel:: Das sind kein Zeitungen.
Possessive Artikel:: Das sind mein Zeitungen.
Das sind die Bücher.
unbestmitte Artikel: –
negative Artikel:: Das sind keine Bücher.
Possessive Artikel:: Das ist meine Bücher.
Das ist das Bus/Auto
unbestmitte Artikel: Das ist ein Auto.
negative Artikel:: Das ist kein Auto.
Possessive Artikel:: Das ist mein Auto.
Das sind die Blumen.
unbestmitte Artikel: –
negative Artikel:: Das sind keine Blumen.
Possessive Artikel:: Das sind meine Blumen.
Das ist der Lehrer/die Lehrerin.
unbestmitte Artikel: Das ist ein Lehrer.
negative Artikel:: Das ist kein Lehrer.
Possessive Artikel:: Das ist mein Lehrer.
Das ist die Katze.
unbestmitte Artikel: Das ist eine Katze.
negative Artikel:: Das ist keine Katze.
Possessive Artikel:: Das ist meine Katze.
Das ist der Kugelschreiber.
unbestmitte Artikel: Das ist ein Kugelschriber.
negative Artikel:: Das ist kein Kugelschriber.
Possessive Artikel:: Das ist mein Kugelschriber.
Das ist die Schokolade.
unbestmitte Artikel: Das ist ein Schokolade.
negative Artikel:: Das ist kein Schokolade.
Possessive Artikel:: Das ist mein Schokolade.
Das ist das Mädchen.
unbestmitte Artikel: Das ist ein Mädchen.
negative Artikel:: Das ist kein Mädchen.
Possessive Artikel:: Das ist mein Mädchen.
Das ist der Elefant.
unbestmitte Artikel: Das ist ein Elefant.
negative Artikel:: Das ist kein Elefant.
Possessive Artikel:: Das ist mein Elefant.
Note:
To ask someone what the article of a Nomen (noun) is, we use the following sentence.
Was ist die Artikel von ____?
To ask someone what does a particular word mean or to show people object and ask what the word of the object is, then we use the following sentence.
Noun is called Nomen in German and it is always written with it’s first letter capital. This might not be intuitive for an English speaker but it is the rule. Also, in most of the case written with artikel. The artikel depends on the case if it is Nominative, Akkusative or Dativ. And in German, based on those artikel, we have different work for saying “not”, “one” and “my”. Here, Artikel im Nominative is explained:
should be able to change the shape of the array dynamically.
should be able to add/delete element fast
should be able to insert/delete a element in the middle.
since we need to make this as efficient as possible. Let’s try what would we have done if we had to invent it for ourself.
First, we take the functionality of it and try to simplify it as much as possible.
Here, let’s take only the ‘adding dynamically’ part.
Adding Dynamically
say we have an fixed array of 4 elements. Then, how can we make it such that we can add an element to it.
Here, we know that we need to describe on a fixed space required for our task before hand to utilize a memory (refer to how memory works).
Alternative #1: Make an array of 5 element then copy all the data to the new array.
Here, using this we can make an dynamic array. However, it is very expensive to do this for huge amount of data.
For example, for an 1M length array, we need to perform around 90 billion copies.
Here lets assume we are continuously adding element to the array. So, for 5th element we need 5 copying operation. For the 6th element, we need to first create a new array of size 6 and copy the 6 elements. So here our total operations is 5+6. For the 7th element, it is 5+6+7. In big O notation it is $O(N^2)$.
Alternative #2: Making a new array of size of the fixed array + 8 (say).
Here, it will reduce a lot of copying operations, however it is still of $O(N^2)$ complexity.
Alternative #3: Making the new array the double size of the array.
Here, the number of copying operation needed for an array of size N is always N. So it’s complexity is $O(N)$.
This is very cool problem, so if you are math savvy then take out a piece of paper and do the math, it is quite fun to think about this problem. Find how this is $O(N)$.
This is how programming languages define dynamic arrays.
For deletion, say if the filled element is less that the half of the size then we reduce the size of array by half. Here, this means our memory usage has also been optimized.
Similarly, with this way we can also easily perform insertions and deletion from the middle or front of the array with high speed .
According to the author, ML is a collection of tools and techniques that transforms data into decisions.
Basically, ML is about 2 things:
Classifying things (Classification) and
Quantifying Predictions (Regression).
Comparing ML Methods:
To choose, which methods to use for your application, we can just compare the prediction of the method/model with the actual outcomes. This is called evaluation of a model and the metrics used are called evaluation metrics.
For this, we first fit the model to the training data.
Then make predictions based on the trained model.
Then we evaluate the predictions made on test set with the actual outcome.
We can do this for different model/methods and based on the evaluation metrics we can select a suitable method for our application.
Here, just because a machine learning methods fits the training data well, it doesn’t mean it will perform well with the Testing Data.
Fit Train Data well but poor predictions = Overfitting
Doesn’t fit train data well = Underfitting.
Independent and Dependent Variables
variable: value of which vary from record to record.
Say that we have two variables, ‘height’ and ‘weight’. And let us also say that height prediction depends on weight of a person, then here, the ‘height’ is a dependent variable, and ‘weight’ is an independent variable, as this variable used to predict a dependent variable.
Here, the independent variables are also called features.
Discrete and Continuous Data
discrete data: countable values. only takes specific values.
continuous data: measurable values under a particular pre-defined range.
Chapter 2: Cross Validation
From Chapter 1 we learned, that we train the model on ‘train set’ and evaluate the model on ‘test set’.
But how do we decide on what data points to choose for ‘test set’ or the ‘train set’.
The answer is cross validation.
Say you have 10 data points. And let us say that we have chosen a 80/20 train-test split. This means that we are going to assign 8 points randomly to train set and the rest of the 2 data point for test set. Here the 2 points chosen will not be used in test set for the next cross validation. So, for second we choose another 2 data points for test set and remaining for train set. We can do this 5 times, since our total data points is 10 and we have chosen a 80/20 train-test split.
Therefore, cross validation is a way solving the problem of not knowing which points are the best for testing by using them all in an iterative way.
You can also think of it like make 5 groups. And each time using one group as the ‘test set’ and remaining as the ‘train set’.
The number of iterations/groups are also called folds. Therefore, this is an example of 5-fold cross validation.
But why can’t we use all the data as ‘train set’.
Because, only way to determine if a model has been overfit or not is to evaluate on new data.
Reusing same data points for training and testing is called Data Leakage.
The main advantage of cross-validation is that it is a proper measure of how good a model has performed, instead of relying on chance for train-test split. Here, if test set is by chance easy, then the model will be interpreted as better than it actually is.
When we have a lot of data, 10-Fold Cross Validation is commonly used.
Another commonly used cross validation is Leave-One-Out.
used all but one point for training, and the remaining point for testing.
iterate until every single point has been tested.
we usually use this for small dataset.
Commonly, sometimes a particular model performs better in some iteration and another model can perform better in other iteration. In such case we use Statistics to decide the better model.
Chapter 3: Fundamental Concepts in Statistics!!!
Main Idea of Statistics:
Statistics provide us a set of tools to quantify the variation that we find in everyday life.
For example, the number of fries you get in a bucket is not always the same. But say that we track it. Then from statistics, we can predict how many fires will we have tomorrow. And how confident can we be in that prediction can also be determined.
Here, say that you predict a positive result, but are not confident, then you will look for alternative approach.
We know to make a prediction, we need to understand the trend of data.
And histogram is a good way visualizing the trend of data.
divide the range into number of bins.
and stack the element based on the frequency of element that fall into a bin.
Here, the question to think of when making a histogram, is the number of bins you should use.
A Naive Bayes algorithm makes prediction using histogram.
Calculating probability:
Probability of occurrence of something is the total number of occurrence divided by the number of observations made.
Here, the more number of observation we have more confident we can be of our predictions.
But, we know that collecting more samples is expensive both monetarily and time-wise.
We can solve this problem using Probability Distribution.
To improve the retrieval of the RAG system, you usually need to fine-tune the pre-trained LLM. However, it can also be done by using Instructor models. https://huggingface.co/hkunlp/instructor-xl
Unsupervised document classification for imbalanced data sets poses a major challenge as it requires carefully curated dataset.
The authors propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirich-let Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling.
Topic modeling with method like Latent Dirichlet Allocation (LDA) topic modeling (most commonly used model) is satisfactory for large text, but for small texts like (tweets) it is challenging.
The author compared the performance of LDA, Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data.
They found that GSDMM and GPM are better for sparse data.
Notes
EMBEDDING → DIMENSIONALITY REDUCTION → CLUSTERING → TOP WORDS ( I-TF-IDF and COSINE SIMILARITY) → LLM (GPT 4.0 AND GPT 3.5) for topic naming and description
Answering questions using LLM (ChatGPT + RAG Implementation)
My initial thoughts and what I would like to get out of this?
What makes it difficult for learning tabular-data for deep-learning algorithms or even Neural Networks?
What is the correct way for benchmarking?
Notes on the paper
There are benchmarks for Deep Learning, however not much for tabular-data.
The superiority of GBTs over NNs is explained by specific features of tabular data:
irregular patterns in the target function,
uninformative features, and
non rotationally-invariant data where linear combinations of features misrepresent the information.
the paper defines
a standard set of 45 datasets from varied domains with clear characteristics of tabular data and
a benchmarking methodology accounting for
fitting models and
finding good hyperparameters.
results show that tree-based models remain SOTA on medium-sized dta (~10K samples)
Inductive biases that lead to better performance for tree-based models for tabular data
NNs are biased to overly smooth solutions
To test the effectiveness of learning smooth functions, the authors smoothed the train set using Gaussian Kernel.
Smoothing degrades the performance of the decision tress but does not effect the performance of the NNs.
NN’s are biased towards low-frequency functions. What does this mean?
regularization and careful optimization does help NNs learn irregular patters/functions.
Periodic embedding of PLE might help learn the high-frequency part of the target function.
This also explain why ExU activation is better for tabular deep learning.
Now the question is why does the NNs fail to learn irregular patterns? And why does PLE help NNs learn better. Does this make the train set more regular?
Uninformative features affect more MLP-like NNs.
MLP like structure have harder time to filter out uninformative features as compared to GBTs.
What does uninformative features mean?
Features that does not provide meaningful or useful information to help make predictions or gain insights from the data.
For GBT, even if we remove half of the features(informative as well as uninformative), the performance does not degrade as much.
However, for NNs (Resnet, FT-Transformer) removing features negatively affect the performance of the model.
Therefore, they are less robust to uninformative features.
MLP like structure’s inherent rotational invariance prevents it from easily isolating and ignoring uninformative features when they are mixed through linear transformations.
However the, weak learners in GBTs, recursively partition the feature space by making splits based on individual feature value. Therefore they are not rotationally invariant, and therefore can easily filter out uninformative features.
Here, FT-Transformer requires a embedding layer due to its use of Attention mechanism. But this embedding transports the feature into different embedding space, breaking the rotational invariant bias of a MLP like architecture.
Data are non invariant by rotation, so should be learning procedures.
Why are MLPs much more hindered by uninformative features, compared to other models?
Random has not lead to much difference in performance of ResNet, and leads to slight degradation in FT-Transformer. But hugely disrupts the performance of GBTs.
This suggests that rotation invariance is not desirable: similarly to vision [Krizhevsky et al., 2012],
We note that a promising avenue for further research would be to find other ways to break rotation invariance which might be less computationally costly than embeddings.
Challenges to build tabular-specific NNs as per the authors
be robust to uninformative features,
preserve the orientation of the data, and
be able to easily learn irregular functions.
Deep Learning architectures have been crafted to create inductive bias matching invariance and spatial dependencies of data.
This means that the inherent assumptions of the model, like what structure does the model expect the input data to have help match with the invariance of the model (for example., how CNN has translational invariance i.e., it will detect an object wherever the object is placed. It does have some other factor but now lets only consider object that is less than the size of the window of CNN).
Benchmark
the code provides 45 datasets split across different settings
medium/large
with/without categorical features
accounts for hyperparameter tuning cost.
But how does it account for it?
Hyperparameter tuning leads to uncontrolled variance on a benchmark Bouthillier et al., 2021, especially with a small budget of model evaluations.
Here at different configuration of hyperparameter, there are different scores. A model can achieve the best score at 3rd trial or may be cannot get the best score until 300th trial. This depends on the configuration, therefore there is an variance in result and this is uncontrollable as we do not know which configuration yields the best result.
design a benchmarking procedure that jointly samples the variance of hyperparameter tuning and explores increasingly high budgets of model. evaluations.
choosing dataset.
what is “inter-sample” attention
preprocessing dataset?
Raw comparison of DL and tree based models.
Explanations of why Tree-based models work better than NNs.
BENCHMARKING
test
Questions that arose while reading the paper
What are the characteristics that let the authors to select a particular dataset?
The characteristics are as follows:
Heterogeneous Columns: Columns should correspond to features of different nature. Not signal or image.
Not high dimensional: Only dataset with a d/n ratio below 1/10. Note: I am not sure what d/n means in this context.
Dataset cannot have too little information.
Dataset cannot be stream-like or time series.
Should be real-world data.
Dataset cannot have features<4 or samples<3000.
Dataset cannot be too easy.
The authors use a different scoring system. But does it account for different Bayes rate for different dataset, that is the real question.
they remove a dataset if a default Logistic Regression (or Linear Regression for regression) reach a score whose relative difference with the score of both a default Resnet (from Gorishniy et al. 2021) and a default HistGradientBoosting model (from scikit learn) is below 5%.
Dataset should be non-deterministic. That means removing dataset where the target is a deterministic function of the data.
The benchmarking is created to make learning task as homogeneous as possible. Therefore challanges of tabular-data that requires a seperate analysis has been omitted. (Here, the question is, is it ommited form analysis or has it been ommited from the benchmarking? )
Only used Medim-sized training set for the analysis.
Remove all the missing data.
Balanced classes.
categorical features with more than 20 items are removed.
Numerical features with less that 10 uniques values are also removed.
Does tree-based models still remain SOTA for small- and large-sized data?
How do you account for
How can a model be robust to uninformative features?
Can NN models preserve the orientation of the data?
This article introduces the TabR model, a retrieval-augmented model designed for tabular data. It is part of a series on tabular deep learning using the Mambular library, which started with an introduction to using an MLP for these tasks.
Architecture Overview
TabR is a retrieval-augmented tabular deep learning method that leverages context from rest of the dataset/database to enrich the representation of the target object, producing more accurate and up-to-date responses. It uses related data points to enhance the prediction. The TabR model consists of three main components: the encoder module, the retrieval module, and the predictor module. The architecture of the TabR model is illustrated in the figure below:
The model is a feed-forward network with a retrieval component located in the residual branch. First, a target object and its candidates for retrieval are encoded with the same encoder E. Then, the retrieval module R enriches the target object’s representation by retrieving and processing relevant objects from the candidates. Finally, predictor P makes a prediction. The bold path highlights the structure of the feed-forward retrieval-free model before the addition of the retrieval module R.
Model Fitting
Now that we have outlined the TabR model, let’s move on to model fitting. The dataset and packages are publicly available, so everything can be copied and run locally or in a Google Colab notebook, provided the necessary packages are installed. We will start by installing the mambular package, loading the dataset, and fitting TabR. Subsequently, we will compare these results with those obtained in earlier articles of this series.
Install Mambular
pip install mambular
pip install delu
pip install faiss-cpu # faiss-gpu for gpu
Prepare the Data
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Load California Housing dataset
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
# Drop NAs
X = X.dropna()
y = y[X.index]
# Standard normalize features and target
y = StandardScaler().fit_transform(y.values.reshape(-1, 1)).ravel()
# Train-test-validation split
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.5, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
Train TabR with Mambular
from mambular.models import TabRRegressor
model = TabRRegressor()
model.fit(X_train, y_train, max_epochs=200)
preds = model.predict(X_test)
model.evaluate(X_test, y_test)
Mean Squared Error on Test Set: 0.1877
Compared to MLP-PLR and MLP-PLE, it is a comparable performance. However, compared to XGBoost, it is not a good fit. Let’s try the TabR with PLE as numerical pre-processing as already used in the FT Transformer article.
Again, compared to XGBoost, this approach does not seem to be a good fit. Therefore, let’s try with an alternative numerical preprocessing method — let’s try MinMax scaling.
The Mean Squared Error (MSE) on the test set is 0.1573, making this our best-performing approach to date, even outperforming deep learning models as well as, tree-based methods like XGBoost.
Below we have summarized the results from all articles so far. Try playing around with some more parameters and improve performance even more. Throughout this series, we will add the results of each introduced method to this table: