The Priberam Machine Learning Lunch Seminars are a series of informal meetings which occur every two weeks at Instituto Superior Técnico, in Lisbon. It works as a discussion forum involving different research groups, from IST and elsewhere. Its participants are interested in areas such as (but not limited to): statistical machine learning, signal processing, pattern recognition, computer vision, natural language processing, computational biology, neural networks, control systems, reinforcement learning, or anything related (even if vaguely) with machine learning.

The seminars last for about one hour (including time for discussion and questions) and revolve around the general topic of Machine Learning. The speaker is a volunteer who decides the topic of his/her presentation. Past seminars have included presentations about state-of-the-art research, surveys and tutorials, practicing a conference talk, presenting a challenging problem and asking for help, and illustrating an interesting application of Machine Learning such as a prototype or finished product.

Presenters can have any background: undergrads, graduate students, academic researchers, company staff, etc. Anyone is welcome both to attend the seminar as well as to present it. Ocasionally we will have invited speakers. See below for a list of all seminars, including the speakers, titles and abstracts.

Note: The seminars are held at lunch-time, and include delicious free food.

Feel free to join our mailing list, where seminar topics are announced beforehand. You may also visit the group webpage. Anyone can attend the seminars. If you would like to present something, please send us an email.

The seminars are usually held every other Tuesday, from 1 PM to 2 PM, at the IST campus in Alameda. This sometimes changes due to availability of the speakers, so check regularly!

Tuesday, April 14th 2020, 13h00 - 14h00

Luís Borges (CMU/IST)

Evaluating Neural Methods for Approximate String Matching and Duplicate Detection

Location (webinar): Zoom


Duplicate detection concerns with identifying pairs of attributes/records that refer to the same real-world object, thus corresponding to a fundamental process when ensuring data quality in databases. Existing methods to detect duplicate attributes can leverage heuristic string similarity measures based on characters or small character sequences, phonetic encoding techniques that match strings based on the way they sound, or hybrid techniques that combine different approaches.

However, these methods rely on common sub-strings in order to establish similarity, and they often do not effectively capture the character replacements involved in duplicate attributes due to transliterations or the use of different languages and/or alphabets.

This work follows on recent work regarding string matching using deep neural networks, tackling the aforementioned challenges by leveraging recurrent neural units for modeling sequences of characters in order to build semantic representations for the input strings. We consider several alternative neural architectures, e.g. combining recurrent units with attention or pooling operations, or instead based on the Transformer model, and assessed the impact of training data size and/or domain, specifically considering datasets describing collections of person names, organizations, or geographical locations. The obtained results show that the neural models can achieve superior results on all datasets, when compared to standard string similarity measures and even with relatively small amounts of training data, without the need of major tuning on the network parameters. Models trained on a specific domain were nonetheless shown to have problems in generalizing to other domains (e.g., models trained on a dataset composed of person names perform worse when evaluated on pairs of organization names), although some level of knowledge transfer across domains was still observed (e.g., neural cross-domain models were still able to outperform standard string similarity metrics).


Bio: Luís Borges is a CMU Portugal student in Language Technologies, having started his PhD in 2019. His Master’s degree was focused on the application of Natural Language Processing and Deep Learning techniques on the topic of fake news detection. For his PhD, he is working on Neural Information Retrieval, also leveraging NLP and Deep Learning tools.

Eventbrite - Evaluating Neural Methods for Approximate String Matching and Duplicate Detection