Jump to content

Wikifunctions:Status updates/2024-12-12/de: Difference between revisions

From Wikifunctions
Content deleted Content added
Created page with "==== Phase 1: Zeichenkettenbasierte Substitution ===="
Created page with "In Phase 1 verwenden wir eine einfache Substitution von Zeichenketten im Stil von Mad Libs. Dieser Ansatz erfordert vom Benutzer, die richtigen Zeichenketten sorgfältig auszuwählen, was auf Englisch recht einfach ist, auf Französisch oder Deutsch jedoch komplizierter wird."
Line 22: Line 22:
==== Phase 1: Zeichenkettenbasierte Substitution ====
==== Phase 1: Zeichenkettenbasierte Substitution ====


In Phase 1 verwenden wir eine einfache Substitution von Zeichenketten im Stil von [[:de:Mad Libs|Mad Libs]]. Dieser Ansatz erfordert vom Benutzer, die richtigen Zeichenketten sorgfältig auszuwählen, was auf Englisch recht einfach ist, auf Französisch oder Deutsch jedoch komplizierter wird.
<div lang="en" dir="ltr" class="mw-content-ltr">
In Stage 1, we use simple string substitution, in the style of [[:en:Mad Libs|Mad Libs]]. This approach requires the user to carefully select the right strings, which is quite simple in English, but gets more complicated in French or German.
</div>


<div lang="en" dir="ltr" class="mw-content-ltr">
<div lang="en" dir="ltr" class="mw-content-ltr">

Revision as of 21:51, 15 December 2024

Wikifunctions Status-Updates Translate

Abstrakte Wikipedia über Mailingliste Support-Team Abstrakte Wikipedia auf IRC Wikifunctions auf Telegram Wikifunctions auf Mastodon Wikifunctions auf Twitter Wikifunctions auf Facebook Wikifunctions auf Youtube Website von Wikifunctions Translate

Skizzieren eines Pfads zur Abstrakten Wikipedia

Das Hauptziel von Wikifunctions besteht darin, die Abstrakte Wikipedia zu unterstützen: eine Quelle für mehrsprachige Wikipedia-Inhalte, bei der wir die Inhalte nur einmal erstellen und pflegen müssen, sie aber in vielen verschiedenen Sprachen verfügbar haben, um einige der Lücken zu schließen, die derzeit in einigen Wikipedias bestehen.

Heute möchte ich skizzieren, wie die Generierung natürlicher Sprache für die Abstrakte Wikipedia aussehen könnte. Als Beispielziel nehmen wir den folgenden Satz (basierend auf dem Artikel der englischen Wikipedia über Waakye):

Englisch
"Waakye is a Ghanaian dish of cooked rice and beans."
Französisch
"Le waakye est un mets ghanéen de riz et de haricots cuits."
Deutsch
"Waakye ist ein ghanaisches Gericht aus gekochten Reis und Bohnen."

Wir betrachten vier Phasen, um zu diesem Text zu gelangen.

Phase 1: Zeichenkettenbasierte Substitution

In Phase 1 verwenden wir eine einfache Substitution von Zeichenketten im Stil von Mad Libs. Dieser Ansatz erfordert vom Benutzer, die richtigen Zeichenketten sorgfältig auszuwählen, was auf Englisch recht einfach ist, auf Französisch oder Deutsch jedoch komplizierter wird.

So we could have the following function calls:

Instance with origin string-based English("Waakye", "dish", "Ghanaian")

→ "Waakye is a Ghanaian dish."

Instance with origin string-based French("Le waakye", "un mets", "ghanéen")

→ "Le waakye est un mets ghanéen."

Instance with origin string-based German("Waakye", "ein Gericht", "ghanaisches")

→ "Waakye ist ein ghanaisches Gericht."

This is possible right now. It requires quite detailed grammatical knowledge by the function caller, as they need to enter the right form manually. The benefit of this method is difficult to see in this example.

Stage 2: Lexeme-based generation

In Stage 2, instead of using strings, we use Wikidata Lexemes, possible in the past few months. This allows for a version of the function where the function caller does not have to worry about agreement and entering the right form manually, but the function implementer needs to select the right form from the Lexeme instead. This shifts some of the burden from the function user to the function author.

This makes the calling much simpler: we don’t have to know whether "waakye" in French will be "Le waakye" or "La waakye", we don’t have to select the agreeing adjective in German ("ghanaisches Gericht" or "ghanaischer Gericht"), etc. The correct form will be chosen by the Function.

Now we would have the following function calls:

Instance with origin Lexeme-based English(Lxxx/Waakye, L3964/dish, Lxxx/Ghanaian)

→ "Waakye is a Ghanaian dish."

Zxxx/Instance with origin Lexeme-based French(Lxxx/waakye, L24812/mets, Lxxx/ghanéen)

→ "Le waakye est un mets ghanéen."

Zxxx/Instance with origin Lexeme-based German(Lxxx/Waakye, L500931/Gericht, Lxxx/ghanaisch)
→ "Waakye ist ein ghanaisches Gericht."

You also will find that a lot of Lexemes are missing for this particular example, such as the French Lexeme for something from Ghana. We in the Wikimedia movement need to think about how to approach this gap in what is now – and ever should be – in Wikidata's Lexemes.

We were hoping that this would be possible right now, and we created a number of functions during our offsite to test these capabilities. Unfortunately, we learned that the system is currently failing to evaluate most such function calls, and accordingly we decided to put a big focus in the upcoming Quarter on getting these functions to run.

Stage 3: Item-based generation

In the third stage, we would use Wikidata items to help us select Lexemes from a given language that have comparable meanings. The function caller does not have to know or look up the right Lexeme in all the languages they want to generate the text in. They can just put in the relevant Wikidata items, and the function developer can implement the relevant lookups.

This means that whether or not the function caller knows that the concept "dish" is called "mets" in French or "Gericht" in German, they will still be able to create perfectly fluid and correct sentences in those languages.

This allows us to make the following calls (note that all three calls use the same function here, and the caller does not have to know the languages at all):

Instance with origin(Q14783691/Waakye, Q746549/dish, Q117/Ghana, Z1002/English)

→ "Waakye is a Ghanaian dish."

Instance with origin(Q14783691/Waakye, Q746549/dish, Q117/Ghana, Z1004/French)

→ "Le waakye est un mets ghanéen."

Instance with origin(Q14783691/Waakye, Q746549/dish, Q117/Ghana, Z1002/German)

→ "Waakye ist ein ghanaisches Gericht."

Note that the function will in most cases just route to the language-specific functions developed for the previous stage, but that happens behind the scenes and transparently for the function caller.

This is currently not possible to implement on Wikifunctions — we still need to add a function that allows us to find the Lexemes connected to a given Item. We will work on that in the coming Quarter, and are thankful to the Search and Wikidata teams for the necessary pre-work they have recently performed to unlock the possibility.

Stage 4: Item-based content

The final stage we want to discuss today is based on using the knowledge in Wikidata to create text. We can pull from Wikidata that Q14783691/Waakye is a dish from Q117/Ghana, we can look up the ingredients and their Lexemes, etc. Given the current knowledge about Waakye in Wikidata, this could then generate the following sentences:

Food with origin and ingredients(Q14783691/Waakye, Z1002/English)

→ "Waakye is a Ghanaian dish with bean, rice, water, and salt."

Food with origin and ingredients(Q14783691/Waakye, Z1002/French)

→ "Le waakye est un plat ghanéen composé de haricots, de riz, d'eau et de sel."

Food with origin and ingredients(Q14783691/Waakye, Z1002/French)

→ "Waakye ist ein ghanaisches Gericht aus Bohnen, Reis, Wasser und Salz."

This further simplifies writing the function calls: all we need to select is the dish and the language, and we get a whole sentence that can, in many cases, make a good opening sentence for the Wikipedia article about the given dish, or as an entry or short description in various places.

I hope that this gives a good overview of our next few planned steps with regards to natural language generation and how Wikifunctions can support bringing together our different language communities.

Team offsite in Lisbon

Abstract Wikipedia team at the offsite in Lisbon 2024. From left to right, front row: Cory Massaro, Grace Choi, Genoveva Galarza Heredero, Daphne Smit. Back row: James Forrester, Denny Vrandečić, David Martin, Sharvani Haran. Not in picture: Amy Tsay, Amin Al Hazwani, Luca Martinelli, Elena Tonkovidova, Vaughn Walters.

Last week, the team met for its annual meeting in Lisbon, Portugal. What a beautiful city! We enjoyed walking through the city, and had very productive meetings, discussing our plans, team procedures, and using the time for bonding and social cohesion – very difficult and important to achieve in a team that is fully remote.

The most tangible outcome is the planning for the next Quarter; we had very lively discussions to find a consensus, which we still need to write up. We will report on the plan in one of the next two updates.

New tool for querying Wikifunctions

Feeglgeef created a new tool that allows you to query Wikifunctions in a very flexible way. You can search for functions with implementations in Python, Types that use numbers as keys, functions that take three arguments, or return booleans. The tool is available on Replit (note that this is outside of Wikimedia servers), and examples and documentation of the query language are linked from the front page of the tool: wf-query.replit.app

Hogü-456 created an overview of existing tools. If you are aware of more tools, feel free to add them: Wikifunctions:Tools

Recent Changes in the software

There's no release of MediaWiki software this week due to the Release Engineering team's ordered release freeze, so nothing new to update. As always, please alert us if you run into any issues.

News in Types: Gregorian calendar date, Byte, Unicode code point

We finally have a Type for Gregorian calendar dates. We have been working a while towards it, having created a Type for the relevant months, for years, etc. The discussion was lengthy and didn’t lead to a full consensus. A rationale for the decisions on the design of the Type is provided. We invite you to create functions using the Type!

This has been by far the most complex Type we are providing so far.

We would like to create Types for other, non-Gregorian calendars, like the Chinese, Ethiopian, Japanese, Hebrew, and other calendars. If you know any of these calendars well, please reach out so that we can create the respective calendars.

In other Type-related work, proposals for fixing the Byte Type and the Unicode code point Type (previously character Type) have been made. Input is and discussions are very welcome.

Recordings of December’s Volunteers’ Corner

December 2024 Volunteers' Corner

We had a Volunteers’ Corner this Monday, December 9. It was lively with many good questions. A recording of the Corner is available on Commons.

The function we built together is featured below as the Function of the Week.

Recording of Denny’s SWIB24 keynote

Denny Vrandečić gave a keynote address at the Semantic Web in Libraries 2024 conference. The topic was on the role of knowledge representations in a world of large language models. The recording is available on YouTube.

Function of the Week: how many days between two days in the Roman year

The last newsletter introduced the days of the Roman year as a new Type. As of now, we have 18 new functions using the Type. Also, this week’s Volunteers’ Corner created such a function, so we will take a look at the resulting function.

How many days are there between two days? Function Z20733 can answer that question. The function has three arguments: the two days, and a Boolean which tells us whether the days are in a leap year or not. It returns a natural number stating how many days are between the two given days.

It might be easiest to clarify what the function does by looking at the tests:

The tests are incomplete, with the most notable omission being for any tests where the first day is after the second, and what that exactly means with regards to understanding the leap year.

Currently, there is only one implementation for this function so far, which is partly due to the fact that we didn’t have much time left in the Volunteers’ Corner, and so we only did one in composition, because we found that the easiest way to implement the function.

The core of the composition is to turn both days into a number, counting which day of the year it is (i.e. 1 January is the first day, 2 January the second, 1 February the 32nd, etc.), and then subtract the first number from the second. The result is then turned from an integer to a natural number, in order to avoid negative numbers.