The Big Challenge: Fashion Visual Search with Real World Images

Discover Fashion Using Visual User-Generated Content (UGC)

Visual search allows for customers to discover fashion instead of flipping through endless pages of text based searches. In this sense, AI technology gives brands the ultimate tools to be able to give their customers instant and easy access to what they want— the trends they see in social media, the photos they take themselves with their phones of people on the street, screenshotting instagram or a pic in an advert, and allows them to find what they love and buy it now. This is making AI technology a mainstream solution and a necessary investment for fashion retailers.

When this theory is applied in practice, some challenges and difficulties arise. The images which users use to do an image search are not as optimal as the typical image an ecommerce has for their products.

During ReWork Deep Learning London Summit, Arnau Ramisa, Senior Computer Vision Researcher, explained how Wide Eyes Technologies has successfully overcome this difficulty by being able to do a search using real world images.

Image Search: How to connect the “real world images” with product catalog?

BY ARNAU RAMISA, Senior Computer Vision Research at Wide Eyes Technologies

Wide Eyes Technologies works to help people find what they love, faster. Which means that, when someone sees, say, a pair of shoes that they fall in love with, they must be able to find where to buy them. In a zap.

Image retrieval is an area of computer vision that has already delivered impressive results, especially with data like holiday pictures, monuments, or book covers. Granted, garments are a bit tougher: their deformable and stretchable nature adds a layer of complexity to the retrieval process, and generic models like the ones trained on ImageNet don’t do too well. Nevertheless, if we take a lot of product images and train a deep network to recognize them, it will generate image embeddings of decent quality, that can be used for image search.

Yet, one thing is browsing the catalog of a retailer based on image similarity, and another is trying to find the product that you fell in love with while walking on the street, for which you only have a snap from your phone, or a photo from social media.

If the photo quality is good, you may be lucky and find reasonable results, but while product photos are always shot in professional settings, with good illumination and often with a white background; “real world” photos rarely are like that, and they have many added difficulties, like poor resolution, motion blur, bad camera quality, unusual viewpoints, rotations, multiple products appearing in the image, bad lighting, partial views and occlusions, or even instagram filters.

In this scenario it’s very hard for a model trained only on “ideal” catalog pictures, to correctly associate a product image with a picture of the same, or even a similar, product in the street taken by a non-professional photographer using a cellphone.

The question becomes, then, how to bridge the divide between these two different domains, and bring “real-world” pictures (also called visual user-generated content or UGC) closer to their catalog equivalents. Fortunately, we can use metric learning to restructure our embedding space (our image representations) in such a way that corresponding catalog and street photos are close together. This can be achieved with a particular type of architecture called “Siamese Networks”, first proposed for biometric verification by Bromley and others, and Baldi and Chauvin back in 1993.

The Siamese network architecture has two branches that share the same parameters: one for catalog pictures and another for real world photos taken by the customers in our case. To train it, pairs of catalog and real world pictures are processed by the network, and the distance between the obtained representations is used to update the parameters. We want the “real world” and catalog pictures for the same product to be as close as possible.

However, we cannot train the network using only pairs of corresponding images as then, what it would learn, would be to minimize the distance by collapsing all image representations to the same point! We also need to feed it “impostor pairs”, where catalog and real world image do not correspond to the same product, and maybe even not to the same category. Then the loss function gets a new term: a “safety margin”, which we don’t want impostor pairs to cross. Any pair of non-corresponding images that do will increase the loss. This way, the network will learn both from the “true” and the “impostor” image pairs.

Usually as training progresses and the network learns, fewer and fewer negative pairs will actually contribute to the loss, so including them in a minibatch would be wasting space and time, and potentially leading to worse performance. In order for the net to keep learning, it is important then to find hard negatives. These hard negatives can be found by doing the forward propagation for a large batch, retaining only the hardest pairs, and then doing a full learning step with them.

Thanks to Siamese networks, Wide Eyes Technologies can offer fashion retailers an amazing visual shopping experience for their customers: discover fashion with a single snap. After all, capturing images with smartphones has turned into a ‘habit’ and inspiration, when referring to fashion, it always starts with an image.

Leave a Reply

Your email address will not be published. Required fields are marked *