There has been a surge in the number of Machine Learning methods to analyze products kept on retail shelves images. Deep learning based computer vision methods can be used to detect products on retail shelves and then classify them. However, there are different sized variants of products which look exactly the same visually and the method to differentiate them is to look at their relative sizes with other products on shelves. This makes the process of deciphering the sized based variants from each other using computer vision algorithms alone impractical. In this work, we propose methods to ascertain the size variant of the product as a downstream task to an object detector which extracts products from shelf and a classifier which determines product brand. Product variant determination is the task which assigns a product variant to products of a brand based on the size of bounding boxes and brands predicted by classifier. While gradient boosting based methods work well for products whose facings are clear and distinct, a noise accommodating Neural Network method is proposed for cases where the products are stacked irregularly.
Point of Sale Materials(POSM) are the merchandising and decoration items that are used by companies to communicate product information and offers in retail stores. POSMs are part of companies' retail marketing strategy and are often applied as stylized window displays around retail shelves. In this work, we apply computer vision techniques to the task of verification of POSMs in supermarkets by telling if all desired components of window display are present in a shelf image. We use Convolutional Neural Network based unsupervised keypoint matching as a baseline to verify POSM components and propose a supervised Neural Network based method to enhance the accuracy of baseline by a large margin. We also show that the supervised pipeline is not restricted to the POSM material it is trained on and can generalize. We train and evaluate our model on a private dataset composed of retail shelf images.
The problem statement addressed in this work is : For a public sentiment classification API, how can we set up a classifier that works well on different types of data, having limited ability to annotate data from across domains. We show that given a large amount of unannotated data from across different domains and pseudolabels on this dataset generated by a classifier trained on a small annotated dataset from one domain, we can train a sentiment classifier that generalizes better across different datasets.
When performing Polarity Detection for different words in a sentence, we need to look at the words around to understand the sentiment. Massively pretrained language models like BERT can encode not only just the words in a document but also the context around the words along with them. This begs the questions, "Does a pretrain language model also automatically encode sentiment information about each word?" and "Can it be used to infer polarity towards different aspects?". In this work we try to answer this question by showing that training a comparison of a contextual embedding from BERT and a generic word embedding can be used to infer sentiment. We also show that if we finetune a subset of weights the model built on comparison of BERT and generic word embedding, it can get state of the art results for Polarity Detection in Aspect Based Sentiment Classification datasets.
Retail product Image classification problems are often few shot classification problems, given retail product classes cannot have the type of variations across images like a cat or dog or tree could have. Previous works have shown different methods to finetune Convolutional Neural Networks to achieve better classification accuracy on such datasets. In this work, we try to address the problem statement : Can we pretrain a Convolutional Neural Network backbone which yields good enough representations for retail product images, so that training a simple logistic regression on these representations gives us good classifiers ? We use contrastive learning and pseudolabel based noisy student training to learn representations that get accuracy in order of finetuning the entire Convnet backbone for retail product image classification.
Retail scenes usually contain densely packed high number of objects in each image. Standard object detection techniques use fully supervised training methodology. This is highly costly as annotating a large dense retail object detection dataset involves an order of magnitude more effort compared to standard datasets. Hence, we propose semi-supervised learning to effectively use the large amount of unlabeled data available in the retail domain. We adapt a popular self supervised method called noisy student initially proposed for object classification to the task of dense object detection. We show that using unlabeled data with the noisy student training methodology, we can improve the state of the art on precise detection of objects in densely packed retail scenes. We also show that performance of the model increases as you increase the amount of unlabeled data.
Object detection is a popular branch of research in computer vision, many state of the art object detection algorithms have been introduced in the recent past, but how good are those object detectors when it comes to dense object detection? In this paper we review common and highly accurate object detection methods on the scenes where numerous similar looking objects are placed in close proximity with each other. We also show that, multi-task learning of gaussian maps along with classification and bounding box regression gives us a significant boost in accuracy over the baseline. We introduce Gaussian Layer and Gaussian Decoder in the existing RetinaNet network for better accuracy in dense scenes, with the same computational cost as the RetinaNet. We show the gain of 6% and 5% in mAP with respect to baseline RetinaNet. Our method also achieves the state of the art accuracy on the SKU110K dataset.
The recent surge of automation in the retail industries has rapidly increased demand for applying deep learning models on mobile devices. To make the deep learning models real-time on-device, a compact efficient network becomes inevitable. In this paper, we work on one such common problem in the retail industries - Shelf segmentation. Shelf segmentation can be interpreted as a pixel-wise classification problem, i.e., each pixel is classified as to whether they belong to visible shelf edges or not. The aim is not just to segment shelf edges, but also to deploy the model on mobile devices. As there is no standard solution for such dense classification problem on mobile devices, we look at semantic segmentation architectures which can be deployed on edge. We modify low-footprint semantic segmentation architectures to perform shelf segmentation. In addressing this issue, we modified the famous U-net architecture in certain aspects to make it fit for on-devices without impacting significant drop in accuracy and also with 15X fewer parameters. In this paper, we proposed Light Weight Segmentation Network (LWSNet), a small compact model able to run fast on devices with limited memory and can train with less amount (~ 100 images) of labeled data.
Retail Product Image Classification is an important Computer Vision and Machine Learning problem for building real world systems like self-checkout stores and automated retail execution evaluation. In this work, we present various tricks to increase accuracy of Deep Learning models on different types of retail product image classification datasets. These tricks enable us to increase the accuracy of fine tuned convnets for retail product image classification by a large margin. As the most prominent trick, we introduce a new neural network layer called Local-Concepts-Accumulation (LCA) layer which gives consistent gains across multiple datasets. Two other tricks we find to increase accuracy on retail product identification are using an instagram-pretrained Convnet and using Maximum Entropy as an auxiliary loss for classification.
Object detection in densely packed scenes is a new area where standard object detectors fail to train well. Dense object detectors like RetinaNet trained on large and dense datasets show great performance. We train a standard object detector on a small, normally packed dataset with data augmentation techniques. This dataset is 265 times smaller than the standard dataset, in terms of number of annotations. This low data baseline achieves satisfactory results (mAP=0.56) at standard IoU of 0.5. We also create a varied benchmark for generic SKU product detection by providing full annotations for multiple public datasets. It can be accessed at this https URL. We hope that this benchmark helps in building robust detectors that perform reliably across different settings in the wild.
This paper describes our proposed system & experiments performed to detect contextual emotion in texts for SemEval 2019 Task 3. We exploit sentiment information, syntactic patterns & semantic relatedness to capture diverse aspects of the text. Word level embeddings such as Glove, FastText, Emoji along with sentence level embeddings like Skip-Thought, DeepMoji & Unsupervised Sentiment Neuron were used as input features to our architecture. We democratize the learning using ensembling of models with different parameters to produce the final output. This paper discusses comparative analysis of the significance of these embeddings and our approach for the task.
Continuous Speech Keyword Spotting (CSKS) is the problem of spotting keywords in recorded conversations, when a small number of instances of keywords are available in training data. Unlike the more common Keyword Spotting, where an algorithm needs to detect lone keywords or short phrases like "Alexa", "Cortana", "Hi Alexa!", "Whatsup Octavia?" etc. in speech, CSKS needs to filter out embedded words from a continuous flow of speech, ie. spot "Anna" and "github" in "I know a developer named Anna who can look into this github issue." Apart from the issue of limited training data availability, CSKS is an extremely imbalanced classification problem. We address the limitations of simple keyword spotting baselines for both aforementioned challenges by using a novel combination of loss functions (Prototypical networks' loss and metric loss) and transfer learning. Our method improves F1 score by over 10%.
We try to address the problem of document layout understanding using a simple algorithm which generalizes across multiple domains while training on just few examples per domain. We approach this problem via supervised object detection method and propose a methodology to overcome the requirement of large datasets. We use the concept of transfer learning by pre-training our object detector on a simple artificial (source) dataset and fine-tuning it on a tiny domain specific (target) dataset. We show that this methodology works for multiple domains with training samples as less as 10 documents. We demonstrate the effect of each component of the methodology in the end result and show the superiority of this methodology over simple object detectors.
Incremental Learning is well known machine learning approach wherein the weights of the learned model are dynamically and gradually updated to generalize on new unseen data without forgetting the existing knowledge. Incremental learning proves to be time as well as resource-efficient solution for deployment of deep learning algorithms in real world as the model can automatically and dynamically adapt to new data as and when annotated data becomes available. The development and deployment of Computer Aided Diagnosis (CAD) tools in medical domain is another scenario, where incremental learning becomes very crucial as collection and annotation of a comprehensive dataset spanning over multiple pathologies and imaging machines might take years. However, not much has so far been explored in this direction. In the current work, we propose a robust and efficient method for incremental learning in medical imaging domain. Our approach makes use of Hard Example Mining technique (which is commonly used as a solution to heavy class imbalance) to automatically select a subset of dataset to fine-tune the existing network weights such that it adapts to new data while retaining existing knowledge. We develop our approach for incremental learning of our already under test model for detecting dental caries. Further, we apply our approach to one publicly available dataset and demonstrate that our approach reaches the accuracy of training on entire dataset at once, while availing the benefits of incremental learning scenario.
In this paper, we describe the system submitted for the SemEval 2018 Task 3 (Irony detection in English tweets) Subtask A by the team Binarizer. Irony detection is a key task for many natural language processing works. Our method treats ironical tweets to consist of smaller parts containing different emotions. We break down tweets into separate phrases using a dependency parser. We then embed those phrases using an LSTM-based neural network model which is pre-trained to predict emoticons for tweets. Finally, we train a fully-connected network to achieve classification.
We propose a simple Neural Network model which can learn relation between sentences by modeling the task as Earth Mover's Distance(EMD) calculation. Underlying hypothesis is that a neural module can learn to approximate the flow optimization in EMD calculation for sentence comparison. Our model is simple to implement, light in terms of parameters and works across multiple supervised sentence comparison tasks. We show good results for the model on two datasets. Our model combines LSTM with a relational unit to model sentence comparison.
We propose a weakly supervised method using two algorithms to predict object bounding boxes given only an image classification dataset. First algorithm is a simple Fully Convolutional Network (FCN) trained to classify object instances. We use the property of FCN to return a mask for images larger than training images to get a primary output segmentation mask during test time by passing an image pyramid to it. We enhance the FCN output mask into final output bounding boxes by a Convolutional Encoder-Decoder (ConvAE) viz. the second algorithm. ConvAE is trained to localize objects on an artificially generated dataset of output segmentation masks. We demonstrate the effectiveness of this method in localizing objects in grocery shelves where annotating data for object detection is hard due to variety of objects. This method can be extended to any problem domain where collecting images of objects is easy and annotating their coordinates is hard.
Tuberculosis(TB) in India is the world's largest TB epidemic. TB leads to 480,000 deaths every year. Between the years 2006 and 2014, Indian economy lost US$340 Billion due to TB. This combined with the emergence of drug resistant bacteria in India makes the problem worse. The government of India has hence come up with a new strategy which requires a high-sensitivity microscopy based TB diagnosis mechanism. We propose a new Deep Neural Network based drug sensitive TB detection methodology with recall and precision of 83.78% and 67.55% respectively for bacillus detection. This method takes a microscopy image with proper zoom level as input and returns location of suspected TB germs as output. The high accuracy of our method gives it the potential to evolve into a high sensitivity system to diagnose TB when trained at scale.
Zero-shot Learners are models capable of predicting unseen classes. In this work, we propose a Zero-shot Learning approach for text categorization. Our method involves training model on a large corpus of sentences to learn the relationship between a sentence and embedding of sentence's tags. Learning such relationship makes the model generalize to unseen sentences, tags, and even new datasets provided they can be put into same embedding space. The model learns to predict whether a given sentence is related to a tag or not; unlike other classifiers that learn to classify the sentence as one of the possible classes. We propose three different neural networks for the task and report their accuracy on the test set of the dataset used for training them as well as two other standard datasets for which no retraining was done. We show that our models generalize well across new unseen classes in both cases. Although the models do not achieve the accuracy level of the state of the art supervised models, yet it evidently is a step forward towards general intelligence in natural language processing.
We train a deep Convolutional Neural Network (CNN) from scratch for visual aesthetic analysis in images and discuss techniques we adopt to improve the accuracy. We avoid the prevalent best transfer learning approaches of using pretrained weights to perform the task and train a model from scratch to get accuracy of 78.7% on AVA2 Dataset close to the best models available (85.6%). We further show that accuracy increases to 81.48% on increasing the training set by incremental 10 percentile of entire AVA dataset showing our algorithm gets better with more data.
Chest X-ray is one of the most accessible medical imaging technique for diagnosis of multiple diseases. With the availability of ChestX-ray14, which is a massive dataset of chest X-ray images and provides annotations for 14 thoracic diseases; it is possible to train Deep Convolutional Neural Networks (DCNN) to build Computer Aided Diagnosis (CAD) systems. In this work, we experiment a set of deep learning models and present a cascaded deep neural network that can diagnose all 14 pathologies better than the baseline and is competitive with other published methods. Our work provides the quantitative results to answer following research questions for the dataset: 1) What loss functions to use for training DCNN from scratch on ChestX-ray14 dataset that demonstrates high class imbalance and label co occurrence? 2) How to use cascading to model label dependency and to improve accuracy of the deep learning model?
We develop a Computer Aided Diagnosis (CAD) system, which enhances the performance of dentists in detecting wide range of dental caries. The CAD System achieves this by acting as a second opinion for the dentists with way higher sensitivity on the task of detecting cavities than the dentists themselves. We develop annotated dataset of more than 3000 bitewing radiographs and utilize it for developing a system for automated diagnosis of dental caries. Our system consists of a deep fully convolutional neural network (FCNN) consisting 100+ layers, which is trained to mark caries on bitewing radiographs. We have compared the performance of our proposed system with three certified dentists for marking dental caries. We exceed the average performance of the dentists in both recall (sensitivity) and F1-Score (agreement with truth) by a very large margin. Working example of our system is shown in Figure 1.
This work is an endeavor to develop a deep learning methodology for automated anatomical labeling of a given region of interest (ROI) in brain computed tomography (CT) scans. We combine both local and global context to obtain a representation of the ROI. We then use Relation Networks (RNs) to predict the corresponding anatomy of the ROI based on its relationship score for each class. Further, we propose a novel strategy employing nearest neighbors approach for training RNs. We train RNs to learn the relationship of the target ROI with the joint representation of its nearest neighbors in each class instead of all data-points in each class. The proposed strategy leads to better training of RNs along with increased performance as compared to training baseline RN network.
We describe a deep learning approach for automated brain hemorrhage detection from computed tomography (CT) scans. Our model emulates the procedure followed by radiologists to analyse a 3D CT scan in real-world. Similar to radiologists, the model sifts through 2D cross-sectional slices while paying close attention to potential hemorrhagic regions. Further, the model utilizes 3D context from neighboring slices to improve predictions at each slice and subsequently, aggregates the slice-level predictions to provide diagnosis at CT level. We refer to our proposed approach as Recurrent Attention DenseNet (RADnet) as it employs original DenseNet architecture along with adding the components of attention for slice level predictions and recurrent neural network layer for incorporating 3D context. The real-world performance of RADnet has been benchmarked against independent analysis performed by three senior radiologists for 77 brain CTs. RADnet demonstrates 81.82% hemorrhage prediction accuracy at CT level that is comparable to radiologists. Further, RADnet achieves higher recall than two of the three radiologists, which is remarkable.
Semantic Similarity between two sentences can be defined as a way to determine how related or unrelated two sentences are. The task of Semantic Similarity in terms of distributed representations can be thought to be generating sentence embeddings (dense vectors) which take both context and meaning of sentence in account. Such embeddings can be produced by multiple methods, in this paper we try to evaluate LSTM auto encoders for generating these embeddings. Unsupervised algorithms (auto encoders to be specific) just try to recreate their inputs, but they can be forced to learn order (and some inherent meaning to some extent) by creating proper bottlenecks. We try to evaluate how properly can algorithms trained just on plain English Sentences learn to figure out Semantic Similarity, without giving them any sense of what meaning of a sentence is.
With the recent advancements in information technology there has been a huge surge in amount of data available. But information retrieval technology has not been able to keep up with this pace of information generation resulting in over spending of time for retrieving relevant information. Even though systems exist for assisting users to search a database along with filtering and recommending relevant information, but recommendation system which uses content of documents for recommendation still have a long way to mature. Here we present a Deep Learning based supervised approach to recommend similar documents based on the similarity of content. We combine the C-DSSM model with Word2Vec distributed representations of words to create a novel model to classify a document pair as relevant/irrelavant by assigning a score to it. Using our model retrieval of documents can be done in O(1) time and the memory complexity is O(n), where n is number of documents.