User:The Anome/SQL

From Wikipedia, the free encyclopedia

This page exists to save useful SQL queries, just in case the Quarry project ever goes away. Where this differs from the Quarry query, the Quarry version should be regarded as authoritative.

See also https://sql-optimizer.toolforge.org/ for a useful SQL analysis tool.

Articles not assigned to WikiProjects[edit]

https://quarry.wmcloud.org/query/72412

SELECT
  article.page_title,
  article.page_id,
  talk.page_id AS talk_id
FROM
  page AS article
  LEFT JOIN page_props ON pp_page = article.page_id  AND pp_propname = 'disambiguation'
  LEFT JOIN page AS talk ON talk.page_title = article.page_title AND talk.page_namespace = 1 AND talk.page_is_redirect = 0
WHERE
  article.page_namespace = 0
  AND pp_page IS NULL
  AND article.page_is_redirect = 0
  AND NOT (article.page_title LIKE "List_of_%")
  AND NOT (article.page_title LIKE "Lists_of_%")
  AND NOT EXISTS (
    SELECT
      1
    FROM
      categorylinks
    WHERE
      article.page_id = cl_from
      AND (
        cl_to IN (
          "Temporary_maintenance_holdings",
          "Candidates_for_speedy_deletion",
          "Requested_RD1_redactions"
        )
        OR cl_to LIKE "%_names"
        OR cl_to LIKE "%_surnames"
        OR cl_to LIKE "Lists_of_%"
      )
  )
  AND NOT EXISTS (
    SELECT
      1
    FROM
      templatelinks
      JOIN linktarget ON lt_id = tl_target_id
    WHERE
      tl_from = article.page_id
      AND lt_namespace = 10
      AND (
        lt_title IN (
          "Wi",
          "Surname",
          "Given_name",
          "Animal_common_name",
          "Plant_common_name",
          "Nickname",
          "Dmbox"
        )
        OR lt_title LIKE "%index%"
        OR lt_title LIKE "%redirect%"
      )
  )
  AND (
    (talk.page_id IS NULL)
    OR (
      (talk.page_id IS NOT NULL)
      AND NOT EXISTS (
        SELECT
          1
        FROM
          categorylinks
        WHERE
          talk.page_id = cl_from
          AND (
            cl_to LIKE "%WikiProject_%"
            OR cl_to LIKE "%-Class_%_articles"
            OR cl_to LIKE "%-importance_%_articles"
            OR cl_to LIKE "%-priority_%_articles"
            OR cl_to LIKE "Unassessed_%_articles"
          )
      )
      AND NOT EXISTS (
        SELECT
          1
        FROM
          templatelinks
          JOIN linktarget ON lt_id = tl_target_id
        WHERE
          tl_from = talk.page_id
          AND lt_namespace = 10
          AND (lt_title LIKE "%WikiProject_%")
      )
    )
  )
ORDER BY
  article.page_title

Biographical articles without WP Biography tag[edit]

https://quarry.wmcloud.org/query/72401

SELECT 
  CONCAT("* [[",article.page_title, "]]") -- , talk.page_id AS talk_page_id
FROM
  page AS article
  LEFT JOIN page AS talk ON talk.page_title = article.page_title AND talk.page_namespace = 1  AND talk.page_is_redirect = 0
WHERE
 -- article.page_id % 10 = 2
  article.page_namespace = 0
  AND article.page_is_redirect = 0
  AND NOT article.page_title RLIKE "^(Lists?_of|[0-9]{4})_.*$" -- no dated events or lists
  AND EXISTS (SELECT 1 FROM categorylinks WHERE cl_from = article.page_id and 
              (cl_to = "Living_people"
               OR cl_to RLIKE "^(Date|Year)_of_(birth|death)_(missing|unknown).*$"
               OR cl_to RLIKE "^[0-9][^_]+_(births|deaths)$")) -- make sure we don't catch things like 'protest-related deaths'
  AND (
    (talk.page_id IS NULL) 
    OR ((talk.page_id IS NOT NULL) AND NOT EXISTS (SELECT 1 FROM categorylinks WHERE cl_from = talk.page_id 
                 and (cl_to LIKE "%Biography_articles" 
                      OR cl_to LIKE "%biography_articles" 
                      OR cl_to LIKE "%WikiProject_Biography%"))))

ORDER BY article.page_title

WikiProject to page category cross-reference[edit]

https://quarry.wmcloud.org/query/72090

Part of a plan to assign articles to WikiProjects using Naive Bayes.

Still being worked on. Limited to just a few pages, for testing purposes. Sorting by all three result columns to make output more compressible later.

Now using modulo arithmetic to sub-sample the page table more uniformly. 1/1000 of the table ~= 6700 articles

page_id % 1000 = 77: Executed in 251.52 seconds as of Tue, 07 Mar 2023 08:21:01 UTC. Resultset (296041 rows) 
page_id % 1000 = 78: Executed in 265.67 seconds as of Tue, 07 Mar 2023 08:28:22 UTC. Resultset (316077 rows) 
page_id % 1000 = 392: Executed in 256.58 seconds as of Tue, 07 Mar 2023 09:22:56 UTC. Resultset (310016 rows) 
page_id % 1000 = 816: Executed in 269.80 seconds as of Tue, 07 Mar 2023 09:31:37 UTC. Resultset (295323 rows) 

With 1/1000 scan, jsonl file is about 30 Mbytes in size, but bz2's to about 1.3 Mbytes.

Running this query repeatedly to cover all articles will take ~3 days., but is necessary to generate ground truth for later categorization.

Next step: Toolforge.

SELECT
  pagecats.cl_to,
  talkcats.cl_to,
  -- REGEXP_EXTRACT(talkcats.cl_to, '(?WikiProject_|.*_importance_|.*_priority_|Unassessed)(.*)'), -- does not work
  COUNT(*) AS my_count
FROM
  page AS article
  INNER JOIN page AS talk ON talk.page_title = article.page_title
  INNER JOIN categorylinks AS pagecats ON pagecats.cl_from = article.page_id
  INNER JOIN categorylinks AS talkcats ON talkcats.cl_from = talk.page_id
WHERE
  article.page_id % 1000 = 816
  AND article.page_namespace = 0
  AND talk.page_namespace = 1
  AND article.page_is_redirect = 0
  AND talk.page_is_redirect = 0
  AND (
        talkcats.cl_to LIKE "%WikiProject_%"
        OR talkcats.cl_to LIKE "%-Class_%_articles"
        OR talkcats.cl_to LIKE "%-importance_%_articles"
        OR talkcats.cl_to LIKE "%-priority_%_articles"
        OR talkcats.cl_to LIKE "Unassessed_%_articles"
      )  
  AND NOT (
            talkcats.cl_to LIKE "%vital%"
            OR talkcats.cl_to LIKE "%Version%"
  )
  AND NOT (pagecats.cl_to LIKE "%Disambig%" OR pagecats.cl_to LIKE "%disambig%" OR pagecats.cl_to LIKE "%set_index%" OR pagecats.cl_to LIKE "Set_index%")
  AND NOT (talkcats.cl_to LIKE "%Disambig%" OR talkcats.cl_to LIKE "%disambig%")
  AND NOT (pagecats.cl_to LIKE "Short_description%" 
           OR pagecats.cl_to LIKE "%_errors%"
           OR pagecats.cl_to LIKE "CS1_%"           
           OR pagecats.cl_to LIKE "%short_description%"
           OR pagecats.cl_to LIKE "%articles%"
           OR pagecats.cl_to LIKE "Articles%"
           OR pagecats.cl_to LIKE "%pages%"
           OR pagecats.cl_to LIKE "%disputes%"
           OR pagecats.cl_to LIKE "Pages%"
           OR pagecats.cl_to LIKE "Use_dmy_date%"
           OR pagecats.cl_to LIKE "%Wikipedia%"
           OR pagecats.cl_to LIKE "%articles%"
           OR pagecats.cl_to LIKE "%Articles%"
           OR pagecats.cl_to LIKE "%Wikidata%"
           OR pagecats.cl_to LIKE "Webarchive%")
GROUP BY  pagecats.cl_to, talkcats.cl_to
ORDER BY my_count DESC, pagecats.cl_to, talkcats.cl_to
-- LIMIT 10000