Implementing Search for Swahili Text
We'll explore a number of techniques for implementing search, from simple keyword filtering to more sophisticated semantic search. We'll use data from the MasakhaNER project for demonstration.
First setup the dependencies.
!pip install -q requests pandas scikit-learn jupyter transformers tqdm
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
import pandas as pd
import requests
import io
import numpy as np
Get the data¶
The data we will be using is Swahili News text data from the masakhane-ner repository, whose origin is the Swahili version of Voice of America
docs_url = 'https://raw.githubusercontent.com/masakhane-io/masakhane-ner/main/text_by_language/swahili/voa_clean_final.txt'
docs_response = requests.get(docs_url)
documents_raw = docs_response.text
documents_raw[:999]
'Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa, watu takriban 14 zaidi wamepata maambukizi ya Covid-19.\nWalioambukizwa wote ni raia wa Tanzania, 13 wakiwa Dar-es-salaam na mmoja mjini Arusha.\nWizara ya afya imeripoti kwamba juhudi za kufuatilia watu waliokuwa karibu na wagonjwa zinaendelea.\nWakati wa maadimisho ya pasaka, wakristo walikusanyika kanisani kwa maombi bila kuzingatia ushauri wa wataalam wa afya.\nKuna mijadala kwenye mitandao ya kijamii Tanzania, kuhusu hatua zinazochukuliwa kudhibithi maambukizi nchini humo.\nNchini Afrika kusini watu 145 zaidi, wameambukizwa virusi vya Corona, na kujumulisha idadi ya watu 2,173 ambao wameambukizwa virusi vya Corona nchini humo.\nTaarifa ya wizara ya afya hata hivyo haijasema idadi ya watu ambao wamekufa wala kupona kutokana na virusi vya Corona nchini humo.\nNchini Sudan, maafisa wameongeza mikakati zaidi ya kuzuia virusi vya Corona kusambaa.\nWamepiga marufuku usafiri wa magari kati ya miji na kutekeleza sheria za hali ya dharura ili ku'
The data is composed of randomly ordered sentences from the News sources. We'll treat each sentence as a single document in our corpus. We extract the sentences line by line into a pandas dataframe.
pd.set_option('display.max_colwidth', 999) # avoid truncation of the column
in_memory_file = io.StringIO(documents_raw)
df = pd.DataFrame([l.strip() for l in in_memory_file], columns=['documents'])
df.tail()
documents | |
---|---|
7667 | Wakati huohuo upande wa utetezi uliomwakilisha Jaji umesema hauna tena sababu ya kuwaita mashahidi wake, lakini itaomba mahakama hiyo kutupia mbali kesi hiyo kwa sababu serikali ya Nigeria imeshindwa kuthibitisha madai yake. |
7668 | Baadhi ya wananchi wa Nigeria wanadai kuwa hatua iliyochukuliwa na Rais Buhari, ni njama ya kumweka mwengine kutoka upande wa Kaskazini mwa Nigeria kuchukuwa cheo hicho cha Jaji Mkuu. |
7669 | Madai yao pia yanalenga suala la njama ya kurudi tena madarakani wakidai kuwa Buhari alikuwa akijitayarisha kushindania urais awamu ya pili na iwapo angeshindwa kesi ikifikishwa Mahakama Kuu kabisa, atakuwa na mtu wake wa karibu atakaye mwonea huruma na kuhakikisha kwamba anapata ushindi. |
7670 | Imetayarishwa na Mwandishi wetu, Collins Atohengbe, Nigeria |
7671 | Walinzi wa pwani ya Libya wamekamata wahamiaji 400 waliokuwa wakonjiani katika pwani ya Mediterranean ya nchi hiyo wakielekea Ulaya na kuwarejesha katika mji mkuu wa Tripoli masaa 24 yaliyopita, Shirika la uhamiaji la Umoja wa Mataifa UN limesema Jumapili. |
df.shape
(7672, 1)
We have 7672 documents in total.
The data includes text from 2020 when the COVID-19 pandemic was a major news item. Let's say that from our dataset, we want to find documents related to Africa's response to the pandemic. We specify the query in swahili:
query = "nchi za afrika zinajadiliana na shirika la afya kutafuta njia za kukabiliana na janga la corona"
Basic Text Search¶
Keyword Filtering¶
A simple technique is to use keywords from our query to filter documents that may have the information we require. We only match documents that contain only the keywords we've selected.
In the example below, we create filters for rows containing each word individually then combine the filters to leave only rows with all the words.
keywords = ['afrika', 'afya', 'corona']
filters = [df.documents.str.contains(word, case=False) for word in keywords]
combined_filter = np.vstack(filters).all(axis=0)
df[combined_filter]
documents | |
---|---|
1706 | Waziri wa Afya Zweli Mkhize amethibitisha vifo hivyo, akisema marehemu wote hao walifia Magharibi mwa Cape province Afrika Kusini ina zaidi ya watu 1,000 walioambukizwa virusi hivyo, ikiwa idadi kubwa zaidi katika Afrika, ambapo zaidi ya watu 3,000 barani humo wamethibitishwa kuwa na ugonjwa wa virusi vya corona. |
Only one document matches all the selected keywords.
While this method is straightforward, it is sensitive to the combination of keywords selected and possible misspellings making it useful for mostly basic searches. When there's more than one match, the results have to be analysed further to determine the most relevant ones.
Vectorization¶
With vectorization, we move from the text representation to a numerical one, which can help with ranking the results.
All the unique words in the document corpus are identified, ordered and assigned a unique index as their identity within the vocabulary. The collection of documents will be represented as a table where each row represents a single document and each column represents a word in the derived vocabulary. This is known as a document-term matrix, with documents on the x-axis and terms on the y-axis.
There are two types of vectorization we'll look at, Count Vectorization and TF-IDF.
Count Vectorization¶
In count vectorization, the value of each cell in the document-term matrix represents the number of times the particular word appears in the document. Calling fit_transform
on the count vectorizer like below first creates the vocabulary by standardizing and ordering all the available terms, assigning a unique index to each (fit), then converts each document into a row of word counts (transform).
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(df.documents)
X.shape
(7672, 19056)
The resultant matrix has number of rows equivalent to the number of documents, and the unique terms extracted are 19056.
We can examine the last few words of the vocabulary:
names = cv.get_feature_names_out()
len(names), names[-7:]
(19056, array(['zulu', 'zuma', 'zungumza', 'zuri', 'zusha', 'zweli', 'évariste'], dtype=object))
We can then create a new dataframe of the document-term matrix as below:
df_docs = pd.DataFrame(X.toarray(), columns=names)
df_docs.head()
00 | 000 | 002 | 01 | 02 | 023 | 041 | 043 | 052 | 069 | ... | zubaidi | zubeir | zuio | zulu | zuma | zungumza | zuri | zusha | zweli | évariste | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 19056 columns
The document-term matrix is a sparse matrix because most words do not occur in most documents, therefore most of the counts will be zero.
Looking at the first document, we can filter out the non-zero values to see the count of words that exist in it.
first_doc = df_docs.loc[0]
first_doc[first_doc > 0]
14 1 19 1 afya 1 covid 1 imeripoti 1 jumatatu 1 kuwa 1 maambukizi 1 takriban 1 tanzania 1 wamepata 1 watu 1 wizara 1 ya 3 zaidi 1 Name: 0, dtype: int64
df.loc[0]
documents Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa, watu takriban 14 zaidi wamepata maambukizi ya Covid-19. Name: 0, dtype: object
Query-Document similarity¶
To search the documents using the query, we need to first need to transform the query using the same vectorizer as the documents. This maps the query to the same vector space; the length of the resulting vector matches the vocabulary size and each index in it contains the count of the specific term in the query.
q = cv.transform([query])
df_query = pd.DataFrame(q.toarray(), columns=names)
df_query.shape
(1, 19056)
Most values are expected to be zeros as well.
encoded_query = df_query.loc[0]
encoded_query[encoded_query > 0]
afrika 1 afya 1 corona 1 janga 1 kukabiliana 1 kutafuta 1 la 2 na 2 nchi 1 njia 1 shirika 1 za 2 Name: 0, dtype: int64
The more the common words and counts between query and document, the more similar they are. We use dot product to calculate the score between each document and query, then rank by score. The closer the vectors of the query and a particular document in the vector space are, the higher the dot-product.
For example, dot product between query and first document:
(first_doc * encoded_query).sum()
np.int64(1)
Across all the docs:
query_vector = q.toarray().flatten()
score = X.dot(query_vector)
score.shape
(7672,)
The score vector has the resulting dot products for each document. We identify the index of the highest one:
score.argmax(), score[2433]
(np.int64(2433), np.int64(30))
The highest match is document at index 2433, with a score of 30.
We identify the particular document from the dataframe.
df.documents[2433]
'” Swali la msingi la Mueller Swali la msingi la Mueller, mkurugenzi wa zamani wa FBI, ambalo anatafuta majibu : Ni iwapo Trump na wasaidizi wake walishirikiana na Warusi kuchafua kampeni ya mgombea wa chama cha Demokrat Hillary Clinton, mwaka 2016, kwa kutuma barua pepe zenye kudhalilisha zilizoibiwa kutoka Kamati ya Taifa ya chama cha Demokrat na mwenyekiti wa kampeni ya Clinton? Au iwapo Trump alikuwa amenufaika bila ya kukusudia na mbinu chafu za Russia? Na iwapo rais alijaribu kuharibu uchunguzi uliofuatia ili kujilinda yeye mwenyewe na washauri wa kisiasa na wasaidizi wake? Huu ndio ujumbe wa Idara ya Sheria kwa bunge la Congress juu ya hitimisho la uchunguzi uliofanywa na Mueller.'
We can show the topN results ranked from highest to lowest score:
top_idx = np.argsort(-score)[:10]
df.iloc[top_idx]
documents | |
---|---|
2433 | ” Swali la msingi la Mueller Swali la msingi la Mueller, mkurugenzi wa zamani wa FBI, ambalo anatafuta majibu : Ni iwapo Trump na wasaidizi wake walishirikiana na Warusi kuchafua kampeni ya mgombea wa chama cha Demokrat Hillary Clinton, mwaka 2016, kwa kutuma barua pepe zenye kudhalilisha zilizoibiwa kutoka Kamati ya Taifa ya chama cha Demokrat na mwenyekiti wa kampeni ya Clinton? Au iwapo Trump alikuwa amenufaika bila ya kukusudia na mbinu chafu za Russia? Na iwapo rais alijaribu kuharibu uchunguzi uliofuatia ili kujilinda yeye mwenyewe na washauri wa kisiasa na wasaidizi wake? Huu ndio ujumbe wa Idara ya Sheria kwa bunge la Congress juu ya hitimisho la uchunguzi uliofanywa na Mueller. |
6107 | Lissu ameongeza kwamba “Kwa kutumia njia hizo za amri za kisiasa, vyombo vya ulinzi na usalama vikiwemo jeshi la wananchi la Tanzania, jeshi la polisi, taasisi ya kuzuia na kupambana na rushwa, na idara ya usalama wa taifa pamoja na mamlaka ya kodi TRA, vinatumika kukamata kwa nguvu na kutaifisha mali za wafanyabiashara wetu wa ndani na makampuni ya wawekezaji kutoka nje. |
968 | Akizungumza na shirika la habari la AFP Akram Taher mumini moja amesema “Sherehe za Eid hazifani manmo wakati huu wa hali ya janga la corona – watu wanahisia ya kua na hofu” Bara la Asia Waislam katika bara la Asia – kutoka Indonesia hadi Pakistan, Malaysia na Afghanistan – wamekusanyika katika masoko katika shamrashamra za manunuzi ya sikukuu, wakikiuka muongozo wa kudhibiti virusi vya corona na wakati mwengine polisi wakijaribu kutawanya mikusanyiko ya makundi makubwa. |
530 | Changamoto za uokozi Msemaji wa shirika la kimataifa la msalaba mwekundu Caroline Haga ameliambia shirika la habari la AFP kwamba wafanyakazi wa uokozi wanakabiliwa na changamoto kubwa kuwafikia wanaohitaji msaada na wana wasiwasi na huzuni, kwani siku ya Jumanne walifanikiwa kuwaokoa watu 167 na kwamba muda unakimbia haraka sana na watu bado wamekwama na wako hatarini. |
842 | “Wakati hatuamini kuwa katika hatua hii, hali hiyo inahitaji kupitishwa azimio, kuna dalili zote za kututahadharisha kuwa mgogoro wa kuminywa kwa haki za binadamu unafukuta,” imesema barua hiyo, ambayo imesainiwa na Mtandao wa kutetea haki za binadamu wa Bara la Afrika, Shirika la Amnesty International, Shirika la ARTICLE 19, Shirika la Asian Forum for Human Rights and Development, Kituo cha Centre for Civil Liberties – Ukraine, Human Rights Watch na Tume ya International Commission of Jurists na taasisi nyingine. |
5142 | Rais Trump Rais wa Marekani Donald Trump, anasema Biden, “amerusha baruti katika moto, na anajukumu la kutoa maelezo kwa watu wa Marekani juu ya mkakati na mpango wa kuhakikisha vikosi vyetu na wafanyakazi wa ubalozi, watu wetu na maslahi yetu, yote hapa nchini na nchi za nje, na washirika wetu katika eneo lote la Mashariki ya Kati na maeneo mengine. |
5682 | ” Ogwell amesema taasisi yake, ambayo ni shirika la ushauri wa kifundi la Umoja wa Afrika, anashirikiana na AU kuzijengea uwezo wa utayari nchi mbalimbali katika maeneo makuu matatu, ikiwemo kuboresha utoaji tahadhari katika bandari za nchi hizo na mahospitali; kuongeza utaalamu wa kuweza kupima kirusi COVID-19, ambao tayari nchi 43 wanauwezo huo; na kujenga uwezo wa kuzuia maambukizi na kudhibiti hali hiyo ili wagonjwa wenye maambukizi waweze kuwekewa karantini na kufuatiliwa. |
4934 | Morales alisema : "Kaka na dada zangu nchini Bolivia na ulimwenguni kote, nawafahamisha niko hapa na Makamu wa Rais na Waziri wa Afya, na baada ya kuwasilikiliza rafiki zangu kutoka shirikisho la vuguvugu la kijamii na shirikisho la umoja wa kibiashara na pia kwa kusikiliza Kanisa Katoliki, natangaza kujiuzulu wadhifa wangu wa urais. |
7355 | Kiongozi huyo wa cheo cha juu katika nchi za falme za kiarabu aliwaambia waandishi wa habari alisikiliza mtazamo wa Jenerali Abdel Fattah Burhan kuhusu matatizo ya Sudan na yeye alimweleza mtazamo wa umoja wa falme za kiarabu kuhusiana na hali hii ya kisiasa nchini Sudan Katibu mkuu wa umoja wa nchi za falme za kiarabu alifanya mazungumzo mjini Khartoum Jumapili na baraza la jeshi linalotawala Sudan. |
5437 | Pamoja na kuwa wanashirikiana katika mipaka yao na kuwepo kwao katika wigo la kiuchumi la pamoja, nchi za Afrika Mashariki zinakuwa na hisia kali baina yao zinazotokana na tofauti zao za kiuchumi na kisiasa. |
The results don't look to relevant, probably because they just happen to be long sentences that contain some words from the query several times. The dot product is sensitive to vector magnitudes, so longer sentences are likely to score higher just because they have a higher count of words in the query.
Cosine similarity normalizes for magnitude of the vectors making the score less sensitive to the absolute counts of similar terms.
We show results for cosine similarity:
from sklearn.metrics.pairwise import cosine_similarity
score = cosine_similarity(X, q).flatten()
df.iloc[np.argsort(-score)[:10]]
documents | |
---|---|
5437 | Pamoja na kuwa wanashirikiana katika mipaka yao na kuwepo kwao katika wigo la kiuchumi la pamoja, nchi za Afrika Mashariki zinakuwa na hisia kali baina yao zinazotokana na tofauti zao za kiuchumi na kisiasa. |
3022 | “Inasikitisha kwamba nchi kadhaa duniani na mashirika mbalimbali, ikiwemo Marekani, China na shirika la afya duniani, yanakabiliwa na tishio la maambukizi ya Corona. |
4808 | Amesema "mwanzoni tulizungumza kujifunza kutoka uzoefu wa nchi nyingine katika kupambana na janga la corona, sasa tunazungumzwa na nchi nyingine kama kielelezo hasi cha vita dhidi ya janga la corona Afrika na duniani. |
664 | Amesema "mwanzoni tulizungumza kujifunza kutoka uzoefu wa nchi nyingine katika kupambana na janga la corona, sasa tunazungumzwa na nchi nyingine kama kielelezo hasi cha vita dhidi ya janga la corona Afrika na duniani. |
6369 | Mkurugenzi Mkuu wa Shirika la Afya Duniani Tedros Adhanom Ghebreyesus, ameonya kufungwa kwa mipaka ya nchi na kusitisha shughuli zote ili kupambana na janga la COVID 19 kunaweza kusababisha kuongezeka kwa vifo kutokana na ugonjwa wa Malaria katika nchi za Afrika. |
4571 | Shahidi mmoja aliliambia shirika la habari la Reuters katika wiki kadhaa za karibuni za maandamano yanayoipinga serikali yaliyochochewa na malalamiko ya kiuchumi na kisiasa. |
6182 | Ndege za kijeshi za India, zilivuka mpaka na kuingia katika nchi jirani ya Pakisan na kutekeleza mashambulizi dhidi ya kambi iliyodaiwa kuwa ya kutoa mafunzi kwa kundi la wanamgambo la Jaish-e-Mohammad, lililoripotiwa kuhusika na shambulizi la bomu la Kashmir. |
6107 | Lissu ameongeza kwamba “Kwa kutumia njia hizo za amri za kisiasa, vyombo vya ulinzi na usalama vikiwemo jeshi la wananchi la Tanzania, jeshi la polisi, taasisi ya kuzuia na kupambana na rushwa, na idara ya usalama wa taifa pamoja na mamlaka ya kodi TRA, vinatumika kukamata kwa nguvu na kutaifisha mali za wafanyabiashara wetu wa ndani na makampuni ya wawekezaji kutoka nje. |
3191 | SAA, shirika kubwa la ndege la Afrika liliingia katika mpango wa kujilinda kutokana na hali ya kufilisika mwezi Disemba 2019, na tangu wakati huo lililazimika kusitisha safari zake zote za abiria kutokana na janga la virusi vya korona kote duniani. |
4993 | Shirika la habari la China Xinhua linasema kutakuwa na karibu safari 200 za ndege kuingia na kutoka Wuhan Jumatano. |
Cosine similarity results in seemingly more relevant results.
We create a generic function for getting search results from the vector space:
def search(X, query, num_results=10):
score = cosine_similarity(X, query).flatten()
idx = np.argsort(-score)[:num_results]
return df.iloc[idx]
TF-IDF Vectorization¶
There are a number of common words like: na, kwa, la, ya, etc, which are common across the entire corpus but are relatively insignificant to the relevance of a particular document to the query.
TF-IDF, term frequency-inverse document frequency in full, minimises the effect of these words. It introduces a new score in place of counts in the document-term matrix that shows how important a term is to the document.
The score is calculated by multiplying the term frequency: the frequency of the term in relation to other terms in the document, and the inverse document frequency: how common the term is across all the documents in the corpus.
To do this, we replace the Count Vectorizer with a TfidVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer()
X_tfv = tfv.fit_transform(df.documents)
q_tfv = tfv.transform([query])
names = tfv.get_feature_names_out()
df_docs = pd.DataFrame(X_tfv.toarray(), columns=names)
first_doc = df_docs.loc[0]
first_doc[first_doc != 0]
14 0.311283 19 0.260709 afya 0.234890 covid 0.273393 imeripoti 0.327010 jumatatu 0.253012 kuwa 0.143414 maambukizi 0.222536 takriban 0.275575 tanzania 0.210560 wamepata 0.413453 watu 0.161859 wizara 0.255201 ya 0.216513 zaidi 0.186419 Name: 0, dtype: float64
We no longer have counts, but the tf-idf score.
We can now do a search, which show improved results:
search(X_tfv, q_tfv)
documents | |
---|---|
664 | Amesema "mwanzoni tulizungumza kujifunza kutoka uzoefu wa nchi nyingine katika kupambana na janga la corona, sasa tunazungumzwa na nchi nyingine kama kielelezo hasi cha vita dhidi ya janga la corona Afrika na duniani. |
4808 | Amesema "mwanzoni tulizungumza kujifunza kutoka uzoefu wa nchi nyingine katika kupambana na janga la corona, sasa tunazungumzwa na nchi nyingine kama kielelezo hasi cha vita dhidi ya janga la corona Afrika na duniani. |
2574 | Mali, nchi yenye huduma mbaya za afya, iliandaa uchaguzi jumapili, licha ya janga la virusi vya Corona. |
6369 | Mkurugenzi Mkuu wa Shirika la Afya Duniani Tedros Adhanom Ghebreyesus, ameonya kufungwa kwa mipaka ya nchi na kusitisha shughuli zote ili kupambana na janga la COVID 19 kunaweza kusababisha kuongezeka kwa vifo kutokana na ugonjwa wa Malaria katika nchi za Afrika. |
7116 | China imesema Jumatano kuwa uamuzi wa Rais wa Marekani Donald Trump kusitisha ufadhili kwa Shirika la Afya Duniani kutaziathiri nchi zote wakati dunia ikikabiliwa na hatua muhimu ya kupambana na janga la virusi vya corona. |
2311 | Hatua hiyo imechukuliwa huku maambukizi yakiendelea kuongezeka kote duniani, na baada ya Shirika la Afya Duniani (WHO) kutangaza maambukizi ya Corona kuwa janga la kimataifa. |
3022 | “Inasikitisha kwamba nchi kadhaa duniani na mashirika mbalimbali, ikiwemo Marekani, China na shirika la afya duniani, yanakabiliwa na tishio la maambukizi ya Corona. |
814 | Katika ukosoaji wa nadra kwa umma, shirika la afya Duniani-WHO wiki iliyopita lilieleza kwamba katika kupingana na kanuni za kimataifa za afya, Tanzania ilikataa kutoa taarifa za kina za kesi zinazoshukiwa kuwa za Ebola. |
911 | Wakati huo huo, Museveni ametangaza mipango ya kuwarudisha Uganda raia wa nchi hiyo ambao wamekwama nchi za nje kutokana na janga la Corona. |
1020 | Kwa mujibu wa shirika la habari la Uingereza Reuters hisa za shirika la ndege la Ujerumani Luftansa zilipanda kwa asilimia 6. |
Embeddings¶
A problem we still have is that we're matching for exact terms in the documents i.e. lexical search, therefore, synonyms and closely related terms won't be captured in the search.
To fix this, we introduce embeddings, which is a techniques that is used to cluster related words together that capture ideas or concepts and contextual information.
Singular Value Decomposition (SVD)¶
This is a technique in linear algebra that operates on a matrix to extract its most important features. It can be used, for example, in lossy compression of images where important features of the image are extracted, and can be used to recreate the original image but with a lower resolution.
When used on a document-term matrix, it extracts association between related words that represent an concept or topic.
We use the vector representation from the TF-IDF vectorizer to create SVD embeddings:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=500)
X_svd = svd.fit_transform(X_tfv)
X_svd.shape
(7672, 500)
We provided the number of components/topics to be extracted as 500. For each document, the matrix shows how much it ranks in importance to each of the 500 topics.
We can see how the first document ranks for the first 5 topics:
X_svd[0,:5]
array([ 0.18967023, -0.06791604, -0.24394782, -0.04354471, 0.00576048])
The n_components
parameter is dependent on the dataset used and some experimentation may be required to arrive at the optimal value.
We similarly create embeddings for the query and run a search:
q_svd = svd.transform(q_tfv)
q_svd.shape
(1, 500)
search(X_svd, q_svd)
documents | |
---|---|
664 | Amesema "mwanzoni tulizungumza kujifunza kutoka uzoefu wa nchi nyingine katika kupambana na janga la corona, sasa tunazungumzwa na nchi nyingine kama kielelezo hasi cha vita dhidi ya janga la corona Afrika na duniani. |
4808 | Amesema "mwanzoni tulizungumza kujifunza kutoka uzoefu wa nchi nyingine katika kupambana na janga la corona, sasa tunazungumzwa na nchi nyingine kama kielelezo hasi cha vita dhidi ya janga la corona Afrika na duniani. |
2574 | Mali, nchi yenye huduma mbaya za afya, iliandaa uchaguzi jumapili, licha ya janga la virusi vya Corona. |
6369 | Mkurugenzi Mkuu wa Shirika la Afya Duniani Tedros Adhanom Ghebreyesus, ameonya kufungwa kwa mipaka ya nchi na kusitisha shughuli zote ili kupambana na janga la COVID 19 kunaweza kusababisha kuongezeka kwa vifo kutokana na ugonjwa wa Malaria katika nchi za Afrika. |
968 | Akizungumza na shirika la habari la AFP Akram Taher mumini moja amesema “Sherehe za Eid hazifani manmo wakati huu wa hali ya janga la corona – watu wanahisia ya kua na hofu” Bara la Asia Waislam katika bara la Asia – kutoka Indonesia hadi Pakistan, Malaysia na Afghanistan – wamekusanyika katika masoko katika shamrashamra za manunuzi ya sikukuu, wakikiuka muongozo wa kudhibiti virusi vya corona na wakati mwengine polisi wakijaribu kutawanya mikusanyiko ya makundi makubwa. |
814 | Katika ukosoaji wa nadra kwa umma, shirika la afya Duniani-WHO wiki iliyopita lilieleza kwamba katika kupingana na kanuni za kimataifa za afya, Tanzania ilikataa kutoa taarifa za kina za kesi zinazoshukiwa kuwa za Ebola. |
3022 | “Inasikitisha kwamba nchi kadhaa duniani na mashirika mbalimbali, ikiwemo Marekani, China na shirika la afya duniani, yanakabiliwa na tishio la maambukizi ya Corona. |
7116 | China imesema Jumatano kuwa uamuzi wa Rais wa Marekani Donald Trump kusitisha ufadhili kwa Shirika la Afya Duniani kutaziathiri nchi zote wakati dunia ikikabiliwa na hatua muhimu ya kupambana na janga la virusi vya corona. |
1505 | Nchi hizo za Afrika magharibi zimekuwa zikirekodi ongezeko la watu wanaoambukizwwa virusi vya Corona, na haijulikani namna zitakavyokabiliana na maambukizi hayo. |
4592 | " Na katika ujumbe wake siku ya Jumapili, Katibu Mkuu wa Umoja wa Mataifa Antonio Gutteres, amesema wakati janga la COVID 19 likiendelea, limetoa pia fursa kwa janga la pilli nalo ni habari za upotoshaji kutokana na ushauri wa hatari wa afya unaotokana na nadharia za dhana zisizo na msingi. |
BERT¶
The previous methods all used the bag-of-words approach; the order of words wasn't taken into consideration when searching. The order of the words in the documents may add contextual useful information that could improve the search.
BERT is a deep neural network model of the transformer architecture that encodes contextual meaning of words taking into account where they occur in a sentence, having being pre-trained on a large corpus of text.
We'll use a variant of BERT called flax-community/bert-swahili-news-classification
, which has been fine-tuned on Swahili news text, that will be downloaded from Hugging Face.
Each pretrained transformer model has a tokenizer that is used to encode the input text into the vocabulary that was used in the training. We instantiate both the tokenizer and the model:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("flax-community/bert-swahili-news-classification")
model = AutoModelForSequenceClassification.from_pretrained("flax-community/bert-swahili-news-classification")
/home/krm/.envs/search_swahili/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(
Using two documents from our text corpus, we run through the process of creating BERT embeddings:
texts = df.documents[:2].tolist()
encoded_text = tokenizer(texts, padding=True, return_tensors='pt')
We can see the encoded documents as the input_ids
attribute:
encoded_text.input_ids.shape, encoded_text.input_ids
(torch.Size([2, 21]), tensor([[ 1057, 117, 902, 117, 367, 362, 2867, 3731, 200, 20, 283, 5271, 869, 349, 8119, 4252, 117, 7632, 21, 588, 22], [15117, 7672, 587, 156, 1938, 115, 367, 20, 870, 1374, 544, 21, 595, 21, 662, 119, 628, 969, 1617, 22, 0]]))
The document embeddings will be found at the last hidden layer before the output layer, after doing a forward pass through the model:
import torch
model.config.output_hidden_states=True
with torch.no_grad(): # Disable gradient calculation since we aren't training
outputs = model(**encoded_text)
last_hidden_states = outputs.hidden_states[-1]
last_hidden_states.shape
torch.Size([2, 21, 768])
compressed_emb = last_hidden_states.mean(dim=1)
compressed_emb.shape, compressed_emb.numpy()
(torch.Size([2, 768]), array([[ 1.4280016 , -0.18435301, -0.03840554, ..., -1.2254812 , -0.68361926, 0.7766203 ], [ 0.8678242 , -0.80565315, -0.9088004 , ..., -0.47340122, 0.38405433, 0.7781875 ]], dtype=float32))
At this point, we have a representation of 768 topic scores for each input document.
To replicate the process for the entire set of documents, we'll do inference in batches so as not to overwhelm the hardware. Modify batch_size
as appropriate for your hardware.
from tqdm import tqdm
def get_embeddings(documents, batch_size=50):
embedding_batches = []
for i in tqdm(range(0, len(documents), batch_size)):
batch = documents[i:i+batch_size]
encoded_text = tokenizer(batch, padding=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**encoded_text)
last_hidden_states = outputs.hidden_states[-1]
embedding_batches.append(last_hidden_states.mean(dim=1).numpy())
return np.vstack(embedding_batches)
X_bert = get_embeddings(df.documents.to_list())
100%|██████████████████████████████████████| 154/154 [20:40<00:00, 8.06s/it]
We also convert the query into embeddings:
q_bert = get_embeddings([query], batch_size=1)
100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 5.60it/s]
X_bert.shape, q_bert.shape
((7672, 768), (1, 768))
We then run a search using BERT embeddings:
search(X_bert, q_bert)
documents | |
---|---|
5679 | Nchi tatu za Africa zinamaambukizi zaidi katika mlipuko wa virusi vya corona bara la Afrika, lakini naibu mkurugenzi wa Vituo vya Kudhibiti Magonjwa na Kinga Afrika wanasema bara lote lazima lichukue hatua. |
3020 | “Umoja wa Afrika unaunga kwa dhati juhudi za WHO katika kupambana na virusi vya Corona na kutaka viongozi wote wa dunia, kuungana katika kuzuia maambukizi na vifo kutokana na virusi hivo” ameandika Mahamat kwenye mtandao huo. |
2460 | Shirika la Afya Duniani (WHO) Ijumaa limeonya kuwa watu 190,000 wanaweza kupoteza maisha mwaka 2020 barani Africa, iwapo serikali zitashindwa kudhibiti maambukizi ya virus vya corona. |
3022 | “Inasikitisha kwamba nchi kadhaa duniani na mashirika mbalimbali, ikiwemo Marekani, China na shirika la afya duniani, yanakabiliwa na tishio la maambukizi ya Corona. |
6369 | Mkurugenzi Mkuu wa Shirika la Afya Duniani Tedros Adhanom Ghebreyesus, ameonya kufungwa kwa mipaka ya nchi na kusitisha shughuli zote ili kupambana na janga la COVID 19 kunaweza kusababisha kuongezeka kwa vifo kutokana na ugonjwa wa Malaria katika nchi za Afrika. |
2536 | Shirika la Afya Duniani, WHO, limeripoti kuchukuwa hatua za haraka kukabiliana na mlipuko wa virusi vya Ebola nchini Uganda, huku hatua za madhubuti zikichukuliwa kuhakikisha kwamba watu wanaoishi kwenye mpaka wa Uganda na Jamhuri ya Kidemokrasia ya Congo wanaanza kupokea chanjo dhidi ya Ebola Hadi sasa, visa vitatu vya maambukizi vimethibitishwa ikiwemo kifo kimoja. |
3016 | Baadhi ya viongozi barani Afrika wameeleza kushangazwa kwao na tamko la rais wa Marekani kwamba anafikiria kusitisha msaada wa kifedha kwa Shirika la Afya Duniani WHO. |
5145 | Wakati timu za wanaokabiliana na Ebola wakihangaika kudhibiti mlipuko wa ugonjwa huo huko Jamhuri ya Kidemokrasia ya Congo, DRC, mawaziri wa afya wa Jumuiya ya Afrika Mashariki (EAC) wanasema wanatambua hatari zake na wanatathmini kuanza kutumia chanjo ya Ebola iliyoko katika majaribio. |
6932 | Mkuu wa Shirika la Afya, WHO, nchini Burundi alifukuzwa wiki iliyopita baada ya kueleza wasiwasi wake juu ya uelewa wa serikali unavyokinzana na hatari inayoletwa na virusi vya corona. |
1481 | Wizara ya afya a Uganda imesema kwamba shirika la afya duniani limetoa mwelekeo kwamba kila kisa cha maambukizi kinahesabiwa katika nchi kimeripotiwa. |
The BERT results look like the most relevant so far.
Conclusion¶
We looked at various techniques that can be used to search textual documents, starting from a simple keyword based approach to a more sophisticated one utilizing language model embeddings like BERT. These are a subset of techniques that are applied in the wider field of Information Retrieval. A good resource for more in-depth study can be found on the Stanford NLP website