Deep Learning methods for Fashion Multimedia Search and Retrieval

Godi, Marco

Online fashion shopping is an increasing market and with this growth comes a greater need for techniques to automate an ever-expanding variety of tasks in a more accurate way. Deep Learning techniques have been successfully applied in many tasks in the fashion domain, such as classification (to recognize different categories of clothes), recommendation (learning the preferences of a user to make suggestions), generation (automatically generate/edit clothes) etc. In this thesis we focus on search and retrieval problems in this domain. This kind of tools can speed up many tasks both on the user side and the industry side. First we start by analyzing existing models for fashion feature extraction and show their shortcomings. The analysis is made using visual summaries, a compact representation of a set of saliency maps, that describe the elements that contributed to a classification. We show that texture information is almost ignored in these models even when it should be significant for a particular style. This brings the second part, where a new kind of texture descriptor is designed, building upon texels, mid-level elements of textures that are repeated. With simple statistics on texels, interpretable attributes can be extracted and used for improving feature representations for tasks such as image retrieval and interactive search. An attribute based descriptor for textures can be plugged in a pre-existing image search framework and easily used by customers who wish to browse a textile catalog, or by designers who wish to choose fabrics for production of clothes. Navigation in this catalog leverages attributes using relative comparisons for a fast exploration of the texture space. We show the advantages of working with texels and how they can be detected using a Mask-RCNN architecture trained on the ElBa dataset, which we introduce in this thesis. It is composed of synthetic images of element-based textures, exploring a wide variety of colors, spatial patterns and shapes. In the third part a framework for Street-To-Shop matching is presented. It is an image retrieval problem where the query image is a picture that contains a clothing item and the gallery set is composed of the pictures of the clothes sold in an e-shop. The goal is to find the product in the shop most similar to the one in the picture. Compared to existing approaches, we focus on the less explored Video-To-Shop problem by extending to the time dimension, extracting information from a video sequence to improve search results even more thanks to an attention mechanism that focuses on the most salient frames. We also design a training procedure that doesn't require bounding box annotations but still yields performances higher than existing approaches that do require it. The model is trained on the MovingFashion dataset, which we also present in this thesis. This provides the user a new ways to browse an online shop, for example by taking pictures of clothes that somebody is wearing or that are seen in a physical shop, and searching for them online automatically. It has also many implications for social media marketing and market research for fashion companies.

CATALOGO DEI PRODOTTI DELLA RICERCA